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ABSTRACT 


>  The  purpose  of  this  thesis  is  to  analyze  the  operation 
of  a  distributed  database  aanageaent  system  under  network 
partitions,  review  a  nuaber  of  existing  methods  proposed  to 
deal  with  this  problea  and  to  present  an  alternate  approach 
that  will  allow  multiple  operating  partitions  upon 
network  partitioning. 

Bhen  a  network  that  supports  a  distributed  database  with 
redundant  data  becoaes  partitioned,  each  partition  aay  func¬ 
tion  separately.  Due  to  this,  independent  updates  at  each 
partition  aay  cause  inconsistencies  to  arise.  At  network 
reconnection  tiae  such  divergent  data,  in  particular  copies 
of  the  saae  data  in  different  partitions  have  to  be  recon¬ 
ciled.  There  is  no  known  general  aethod  for  doing  so. 
Existing  solutions  are  often  unacceptable  because  system 
availability  is  reduced.  Two  recently  proposed  aethods  that 
allow  continuous  operation  of  aultiple  partitions  aay  work 
for  certain  applications  but  are  not  general  enough. 
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In  the  past  decade  there  has  been  considerable  work, 
done  on  snltiprocessor  systems  and  computer  networks.  is 
consequence  of  this  work  the  concept  of  distributed 
computing  systems  was  developed  and  is  presently  a  focus  of 
intensive  research  in  academia  and  Industry. 

In  particular  Distributed  Data  Base  Systems  (DDBS)  have 
became  one  of  the  more  important  research  topics  since  many 
distributed  systems  are  now  being  developed  to  provide  users 
with  convenient  access  to  data  via  some  kind  of  communica¬ 
tions  network. 

i  distributed  database  system  has  the  potential  advan¬ 
tages  of  greater  data  availability  and  reliability  since 
data- items  may  be  replicated  and  accessed  at  several  sites 
throughout  the  system.  Be  use  the  term  "potential**  because 
availability  should  increase  with  the  number  of  copies  of 
the  data.  If  the  multiple  copies  of  data  were  read-only 
then  availability  will,  in  fact,  be  increased,  however,  when 
updates  are  also  allowed,  multiple  copies  nay  provide  no 
improvement  if  mutual  consistency  among  copies  of  the  data 
is  emphasized. 

Hutual  consistency  requires,  that  if  all  update  activity 
were  to  cease,  then  after  some  period  of  tine  all  copies  of 
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the  sue  data  will  converge  to  the  sane  value.  There  have 
been  aany  algorithms  published  Cor  maintaining  mutual 
consistency  during  noraal  operation  of  a  distributed  data¬ 
base  [1]#  [5],  [19]#  [18]  .  Unfortunately,  these 
algorithas  do  not  consider  autual  consistency  in  the  face  of 
network  partitioning. 

A  network  partition  occurs  when  two  or  aore  disjoint 
subsets  of  sites  in  the  network  cannot  exchange  messages 
through  the  network  (i.e.  cannot  coaaunicate  h  each 
other)  even  though  soae  or  all  of  then  are  up  opera¬ 
tional.  A  special  case  of  network  partitioning  occurs  if 
the  only  path  between  two  or  aore  sites  is  the  communica- 
tions  network.  In  this  case  a  single  site  crash  cannot,  be 
distinguished  froa  a  network  partition  that  separates  that 
site  froa  the  rest  of  the  network. 

Network  partitioning  can  coapletely  destroy  mutual 
consistency  in  the  worst  case  and  so  the  usual  solution  to 
deal  with  this  problem  has  been  to  restrict  operation  during 
network  partition  in  such  a  way  that  only  one  group  of  sites 
(i.  e.  within  one  partition)  is  allowed  to  do  the  updates. 
The  basic  idea  behind  this  approach  is  that  no  update  schene 
is  effective  against  partitioning  in  guaranteeing  autual 
consistency  of  data,  unless  data  is  always  kept  accessible 
only  in  one  partition  [19],  *18]  .  The  methods  proposed 
vary  in  the  way  in  which  they  seLect  the  set  of  sites 
allowed  to  do  the  updates. 
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However,  these  kxnd  of  schemes  have  as  a  major  drawback 
that  it  say  be  unacceptable  for  the  non-selected  sites  to 
shutdown  operations  while  the  network  is  partitioned.  He 
must  note  that  it  is  worthwhile  to  have  all  partitions  in 
operation  if  (1)  availability  is  just  as  important  as 
consistency  and  (2)  "conflicts'*  among  copies  of  data  can 
always  be  succesfully  reconciled  (either  automatically  by 
the  system,  or  by  a  user)  when  communications  are  reestab¬ 
lished  and  network  returns  to  normal  operation  116]. 

It  is  necessary  to  realize  that  network  partitions  are 
not  due  exclusively  to  communications  failures  or  site 
crashes.  Networks  can  be  interrupted  for  tactical  reasons 
(as  when  a  warship  decretes  radio  silence  to  avoid  enemy 
detection  of  radio  waves)  or  simply  for  economical  reasons 
(a  corporation  batches  messages  to  be  transmited  over 
different  periods  of  time  to  attain  lower  communications 
costs) . 

The  goal  of  this  thesis  is  to  analyze  and  evaluate  some 
of  the  proposed  methods  for  dealing  with  the  network  parti¬ 
tioning  problem  and  to  give  some  useful  ideas  towards  the 
solution  of  this  problem,  especially  when  availability  is  a 
prime  consideration  in  the  design  of  the  system. 

Chapter  2  presents  some  basic  concepts  that  will  be 
useful  for  a  better  understanding  of  the  following  discus¬ 
sion  and  also  presents  some  problems  and  issues  that  will 
arise  when  the  network  partitions.  Chapter  3  presents  a 


surrey  of  sethods  proposed  to  deal  with  network  partitions, 
placing  special  esp basis  on  two  of  thea  that  allow  non-stop 
operation  of  partitions.  In  chapter  4  we  present  an  alter¬ 
nate  approach  to  continuous  operation  of  partitions  based  on 
precedence  graphs.  We  al30  present  the  algoritha  reguired 
to  detect  conflicts  and  raconcile  the  database  at  network 
reconnection  tise. 
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IX.  Ill  ISZI2IS  EASXtXIQI  EgOBiSft 


A.  INTRODUCTION 

The  best  way  to  describe  the  problea  presented  by 
network  partitions  is  by  giving  an  azaaple.  Suppose  we  have 
a  network  coaposed  of  three  nodes  A, B  and  C,  and  nodes  A  and 
C  have  a  copy  of  data-ob  ject  x,  which  aay  contain  for 
exaaple  a  record  of  the  savings  accoant  of  certain  person  in 
a  Bank.  Suppose  that  th9  coaaunications  are  interrupted  in 
such  a  way  that  site  A  can  coaaunicate  with  site  B  but  none 
of  thea  can  coaaunicate  with  site  Z  and  thus  dividing  the 
network  in  two  partitions  PI  (foraed  by  nodes  A  and  B)  and 
P2  (foraed  by  node  C)  .  In  this  case  both  partitions  PI  and 
P2  have  access  to  data-ob ject  X,  but  if  we  allow  both  parti¬ 
tions  to  independently  update  data-ob  ject  x,  they  aay 
per fora  inconsistent  updates  to  it.  This  will  happen  due  to 
the  iaposibility  of  sending  update  aessages  through  the 
interrupted  coaaunications  line. 

Now,  putting  ourselves  in  the  worst  case,  assuae  the 
savings  account  of  the  person  aentioned  above  has  10,000 
dollars  and  the  person  is  not  very  honest.  Nhen  he  knows 
about  the  partition  he  goes  to  node  A  and  retrieves  all  the 
aoney  in  his  savings  account,  and  iaaediately  goes  to  node  c 
and  does  the  sane  operation. 
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Thus  he  will  have  20,300  dollars  in  his  hands  and  the 
bank  will  have  in  each  partition  data-object  Z  with  the  saae 
value  of  0  dollars  in  ths  savings  account  record.  Bhen 
coaaunication  is  reestablished  between  the  three  nodes  and 
reconciliation  is  done,  we  will  have  a  negative  savings 
account  record  for  data-object  Z  and  the  problea  of  having 
to  recover  that  aoney. 

B.  BASIC  DBFHIIIOIS  AHD  COHCEPTS 

In  this  section  we  will  review  the  basic  concepts  that 
will  be  needed  in  the  rest  of  the  thesis.  A  aore  coaplete 
discussion  of  these  concepts  can  be  found  in  [7],  [10], 
[11]. 

A  distributed  database  system  is  a  collection  of  named 
data-ob jects.  Each  object  has  a  naae  and  a  values  associ¬ 
ated  with  it,  where  a  ■<  a  and  n  is  the  number  of  sites  in 
the  system.  The  sites  are  interconnected  by  a  network  and 
each  site  res  two  software  nodules:  a  transaction  manager 
(TH)  which  supervises  the  execution  of  transactions;  and  a 
data  aanager  (DM)  ,  which  processes  read  and  write  operations 
on  the  data  stored  at  the  3  ite. 

A  logical  database  is  a  set  of  logical  data-ob jects.  k 
copy  of  a  logical  data-object  stored  at  a  site  is  called  a 
physical  data  object.  Logical  data- objects  will  be  denoted 
by  uppercase  letters  i.e.  Z,  and  physical  data-objects  will 
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The  set  of 


be  denoted  by  lowercase  letters  i.e.  x,...xn. 
all  physical  data-objects  stored  at  a  site  is  called  the 
database  of  that  site. 

Operations  on  data  are  grouped  into  transactions.  k 
transaction  is  a  prograa  that  accesses  the  database  by 
issuing  read  and  write  operations  on  logical  data-objects. 
In  the  read  case  its  TH  selects  one  copy  of  the  data-object 
and  issues  a  read  operation  to  the  OH  that  aanages  that 
object.  In  the  write  case  the  TH  issues  a  write  operation 
for  ewery  physical  copy  of  the  logical  data-object. 
Transactions  are  the  units  of  consistency  and  recovery. 
They  can  be  viewed  as  larger  atosic  actions  on  the  systee 
state  which  transform  it  froa  one  consistent  state  to  a  new 
consistent  state.  Transactions  preserve  database  consis¬ 
tency  because  if  soae  atoaic  action  of  a  transaction  (i.e. 
a  Bead)  fails  then  tha  entire  transaction  is  undone 
returning  the  database  to  a  consistent  state. 

k  transaction  is  sale  atosic  by  use  of  a  coaait 
protocol,  k  coaait  is  an  unconditional  guarantee  to  execute 
the  transaction  to  coapletion,  even  in  the  event  of  fail¬ 
ures.  in  abort  is  an  unconditional  guarantee  to  back  out 
the  transaction.  The  problea  of  guaranteeing  transaction 
atoaicity  in  a  distributed  systea  is  that  of  insuring  that 
all  the  sites  either  unaniaously  abort  or  unaniaously 
coaait.  after  the  coaait  the  new  value  is  aade  available  to 
all  other  transactions. 


Concurrency  control  is  the  activity  of  coordinating 
transactions  that  access  a  database  concurrently.  The  goal 
is  to  prevent  concurrent  transactions  froe  interfering  with 
each  other ,  so  that  every  transaction  sees  a  consistent 
database  state.  Inconsistencies  eay  arise  because  trans¬ 
actions,  which  are  the  user's  atoeic  operations  have  a 
coarser  granularity  than  actions  on  objects  which  are  the 
atoeic  operations  directly  supported  by  the  underlying 
systea.  If  several  transactions  execute  concurrently,  their 
actions  get  interleaved  in  an  arbitrary  way,  allowing  data 
inconsistencies  to  arise.  Concurrency  control  aechanisns 
typically  use  lochs  to  regulate  access  to  shared  resources. 
The  loch  is  a  serialization  aechanisn  which  insures  that 
only  one  transaction  at  a  ties  is  using  a  specific  object. 
The  lock  notifies  other  transactions  that  the  object  is 
currently  being  used  and  protects  the  requestor  froe  other 
transactions  trying  to  aodify  the  object. 

A  foreal  definition  of  database  consistency  is  based  on 
the  notion  of  a  serializable  schedule.  &  schedule  is  any 
sequence  of  actions  performed  by  a  set  of  transactions  on 
database  objects.  A  schedule  is  serializable  if  it  is 
equivalent  to  a  serial  schedule,  that  is,  to  a  schedule  in 
which  transactions  execute  serially,  one  after  the  other 
with  no  concurrency* 

A  schedule  is  consistent  if  and  only  if  it  is  seriali¬ 
zable.  Generally  serializ ability  is  obtained  by  requiring 


that  each  transaction  in  the  schedule  be  two-phase  and  well 
foraed.  A  transaction  is  two-phase  if  it  newer  locks  any 
data  after  releasing  soae  lock.  It  is  well  foraed  if  it 
always  locks  in  exclusive  aode  any  data  that  it  writes  and 
locks  in  shared  aode  any  lata  that  it  reads.  In  order  to 
facilitate  easy  recovery  it  is  required  that  all  the  locks 
be  released  at  the  end  of  the  transaction. 

A  log  (soaetiaes  called  audit  trail  or  journal)  is  a 
history  of  all  the  actions  of  transactions  on  recoverable 
objects.  Each  action  which  aodifies  a  recoverable  object 
writes  a  log  record  giving  the  old  and  new  values  of  the 
updated  object.  Read  operations  need  generate  no  log 
records,  but  update  operations  aust  record  enough  inforaa- 
tion  in  the  log  so  that  given  the  record  at  a  later  tiae  the 
operation  can  be  coapletely  undone  or  redone.  These  records 
will  be  aggregated  by  transaction  and  collected  in  a  coaaon 
systea  log.  The  log  is  desirable  because  we  want  to  be  able 
to  coaait  or  undo  updates  in  a  per-transaction  basis  without 
affecting  other  transactions. 

C.  PROBLEBS  ABD  ISSUES 

In  this  section  we  present  soae  problems  and  issues  that 
should  be  considered  when  dealing  with  the  problem  of 
network  partitioning. 
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EuUiius 

When  a  network  partition  occurs  we  have  three  basic 
alternatives: 

(1)  Balt  all  transaction  processing  in  the  partitions 
until  the  network  is  coapletelf  reconnected  again. 

(2)  Allow  one  partition  to  process  transactions  that 
update  data-objects  while  the  rest  say  accept  read¬ 
only  transactions.  * 

(3)  Allow  all  partitions  to  continue  operating  nin 
parallel"  during  partition  and  reconcile  the  databases 
at  partition  serge. 

We  could  consider  two  aore  alternatives.  Pirst,  to  delay  all 
transactions  during  the  partition,  and  second,  to  execute 
all  transactions  and  than  roll-back  the  entire  data-base 
reexecuting  again  all  transactions  after  partition  ends. 
These  alternatives  are  not  considered  because  we  would  be 
better  off  if  we  siaply  use  alternative  (1) .  Clearly  alter¬ 
native  (1)  is  not  reasonable  since  we  have  as  one  of  the 
advantages  of  a  distributed  systea  its  increased  avail¬ 
ability.  Halting  transaction  processing  in  all  partitions 
will  be  contrary  to  the  idea  of  having  replicated  data  to 
sake  data  accessible  after  failures. 


*The  user  should  receive  a  warning  which  alerts  him  of 
the  possibility  that  the  values  may  be  out  of  date. 


Alternative  (2)  seeas  aore  practical  and  is  in  fact 
asaally  taken  as  a  reasonable  coaproaise.  Host  of  the 
aethods  proposed  to  deal  vith  network  partitions  allow  only 
one  group  of  sites  to  process  transactions  C  1  ]#[  15  ], [  9 ] 
Allowing  only  one  partition  in  operation  facilitates 
recovery  after  the  partition  since  to  reconcile  the  data** 
bases  it  is  only  required  that  sites  in  the  non**active 
partitions  perfora  all  the  updates  they  aissed.  The  only 
problea  in  this  approach  is  to  guarantee  that  at  aost  one 
group  of  sites  processes  transactions.  In  chapter  3  we 
review  soae  of  the  aethods  proposed  in  order  to  achieve  this 
objective.  However,  these  approaches  nay  be  unacceptable  to 
those  sites  that  aust  reaain  non-active  during  partition 
when  availability  is  highly  desired. 

The  third  alternative,  to  allow  all  partitions  to 
process  transactions,  should  be  the  goal  of  a  distributed 
systea  where  availability  is  one  of  the  priaary  concerns. 
However,  these  are  soae  serious  probleas  in  allowing 
"parallel"  operation  of  partitions.  As  each  partition 
processes  different  transactions  and  stores  different  values 
into  the  databases,  the  values  of  the  data-objects  stored  of 
sites  in  different  partitions  will  diverge  and  database 
reconciliation  is  required  when  the  network  is  reconnected. 

In  order  to  aake  the  databases  consistent  after 
partition  we  can  use  two  strategies.  The  first  strategy  is 
to  undo  transactions  that  aade  conflicting  updates  to  data 


objects.  For  exaaple,  assusing  two  two  partitions,  at 
partition  serge  transactions  in  different  partitions  that 
updated  physical  copies  of  the  sane  data-object ' are  detected 
and  soae  of  then  are  undone.  The  value  of  the  data-objects 
in  the  partition  where  the  transactions  were  undone  is  nade 
equal  to  the  value  updated  by  transactions  that  were  left. 
In  iaportant  consideration  is  that  each  transaction  that 
read  the  values  updated  by  a  transaction  that  was  undone, 
should  be  also  undone.  This  requires  a  detailed  log  and  the 
necessary  overhead  to  detect  conflicting  transactions.  Also 
the  users  that  executed  transactions  during  partition  will 
not  know  if  the  values  produced  by  their  transactions  are 
valid  or  not  until  partition  is  corrected. 

The  second  way  of  achieving  autual  consistency  after 
partition  is  to  use  seaantic  knowledge  in  order  to  "inte¬ 
grate"  the  values  of  diverging  data-objects  [19].  This  is  a 
very  difficult  problea  and  has  been  discussed  in  detail  by 
Faissol*  in  [8].  For  exaaple,  an  object  r  in  an  airline 
reservation  systea  indicates  the  nuaber  of  available  seats 
in  a  flight.  If  after  the  partition  values  of  object  r  are 
vl  and  v2,  then  the  corract  value  of  r  is  given  by  vl  ♦  v2 
sinus  the  value  of  r  before  the  partition.  dote  that  if 
the  reconciled  value  is  negative  then  reservations  will  have 
to  be  cancelled  with  the-,  consequent  discoafort  of  sose 


*By  the  use  of  partitio nable  integrity  assertions.  This 
is  discussed  in  chapter  3. 
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affected  customers.  Obviously,  special  neasures  should  be 
taken  in  partitioned  aode  operation  to  avoid  these  probleas. 

As  ve  can  see  we  have  to  pay  a  high  price  in  order 
to  assure  increased  availibility  of  data  during  network 
partition.  However,  there  are  eoae  circuastances  that 
lessen  the  overhead  which  would  be  incurred  in  detecting  and 
solving  conflicts  otherwise.  For  ezaaple,  it  has  been 
pointed  out  in  [5,6]  that  in  a  large  class  of  applications 
aost  transactions  require  little  or  no  synchronization  at 
all  because  they  will  never  interfere  with  each  other. 

2.  Correct  Operation  3 nder  He£jfork  Partitions 

In  order  to  provide  correct  operation  of  a  distrib¬ 
uted  database  under  a  network  partition  there  are  three 
aspects  that  should  be  observed: 

(1)  Preservation  of  autual  consistency. 

(2)  Coapliance  with  integrity  constraints. 

(3)  Control  of  external  actions. 

within  each  partition  autual  consistency  between 
copies  of  data-objects  at  different  sites  is  preserved  using 
concurrency  control  aethods  in  the  sane  way  as  they  would  be 
in  a  connected  network.  Iherefore,  each  copy  of  the  data¬ 
base  in  different  partitions  is  internally  consistent. 
However,  since  there  is  no  coaaunication  between  partitions 
the  transactions  in  each  partition  will  run  without  coordi¬ 
nation  between  then  and  we  nay  end  ip  with  different  values 


of  the  aodified  objects.  That  is,  if  the  saee  logical3 
data-object  is  eodified  by  transactions  in  different  parti¬ 
tions,  then  we  will  have  a  globally  inconsistent  state. 
Vote  that  even  if  the  value  of  the  saee  data-object  in  two 
partitions  is  equal  we  cannot  assuae  that  the  correct  value 
is  the  value  stored  at  both  partitions.  For  exaaple  if  our 
bank  account  balance  is  5000  dollars  in  each  partition  and 
it  is  debited  equally  with  2000  dollars  we  will  find  at 
partition  serge  that  both  balances  are  3000  dollars. 
However,  this  is  not  the  correct  value  since  if  both  trans¬ 
actions  would  have  been  executed  with  a  connected  network 
the  final  value  of  the  account  balance  would  be  1000 
dollars. 

issuing  we  do  not  know  anything  about  the  seaan- 
tics*  of  updates  applied  to  data-objects  we  can  solve  the 
inconsistency  that  arose  in  the  exaaple  above,  at  partition 
aerge  by  first,  detecting  the  conflicting  transactions  in 
both  partitions  and  second,  reconciling  the  two  copies  of 
the  data-object.  Beconciliation  will  require  that  one  of 
the  transactions  be  backed  out,  then  forward  the  update  of 
the  reaaining  transaction  to  the  other  partition  and  finally 
to  execute  the  backed  out  transaction  in  both  partitions. 
Figure  2.1  shows  the  process. 


3See  definition  in  section  B. 

♦is  is  the  case  on  the  aethod  we  develop  in  chapter  4. 
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Partition  (PI) 

T1  reads  balance  =  5000 
T1  writes  balance  *  3000 
I 
I 


►i 


Partition  (P2) 

T2  reads  balance  *  5000 
T2  writes  balance  =  3000 
I 
I 


•— >  Partition  Herge  < - 

Conflict  is  detected 
I 
I 
7 

Baclc-oat  T2  from  P2 
T1  writes  balance  =  3000  in  P2 
T2  reads  balance  *  3000  in  PI  and  P2 
T2  writes  Balance  »  1000  in  PI  and  P2 


Figaro  2*1  Haste ring  Mutual  Consistency 


Of  coarse,  it  is  net  always  this  simple,  and  we  will 
have  to  make  several  considerations  to  restore  mutual 
consistency  in  more  complicated  cases.  However,  the  main 
idea  behind  this  example  is  that  it  is  possible  to  restore 
mutual  consistency  in  a  database  that  has  been  idependently 
modified  in  different  partitions  and  so  mutual  consistency 
can  be  preserved. 

Integrity  constraints  had  been  classified  in  [2]  as 
operational  constraints  and  semantic  constraints. 
Operational  constraints  are  those  related  to  the  preserva¬ 
tion  of  database  integrity  against  inconsistencies  that 
arise  from  the  concurrent  execution  of  several  transactions. 
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is  ve  have  seen  the  concurrency  control  mechanism 
trill  assure  that  operational  constraints  are  not  violated. 
Semantic  constraints  are  those  related  to  the  preservation 
of  the  database  integrity  against  inconsistencies  that  arise 
fros  violations  of  what  data  is  supposed  to  aean.  Por 
example,  in  a  record  for  a  course  containing  fields:  Exam*, 
Homework  %  ,  Labs  *  indicating  the  percentage  of  the  grade 
devoted  to  each  of  then,  ve  would  expect  that  the  sum  of  the 
values  of  the  fields  is  103. 

Unless  ve  use  semantic  knowledge  to  implement  an 
approach  to  continuous  operation  of  partitions  as  in  [8], 
the  requirements  for  compliance  with  operational  and 
semantic  constraints  are  the  sane  in  each  partition  as  the 
ones  in  the  completely  connected  network. 

In  addition  to  the  database  contents,  external 
actions  may  have  been  performed  in  response  to  a  transaction 
and  some  of  these  cannot  be  reversed.  Por  example 
dispensing  cash  to  a  customer  is  in  theory  an  irrecoverable 
external  action.  Under  a  network  partition  the  problem  of 
allowing  external  actions  becomes  more  complex  because  of 
the  independant  execution  of  transactions  in  different 
partitions.  External  actions  must  be  restricted  when  oper- 
ating  in  partitioned  node  anless  ve  can  reverse  the  external 
action  by  some  kind  of  compensation.  For  example  a  message 
send  to  a  terminal  must  be  followed  by  a  validation  note. 
If  the  validation  note  is  not  received  then  the  user  will 
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know  that  the  aessage  received  nay  not  be  valid  and  decide 


what  action  to  take.  If  we  cannot  give  a  compensation  for 
an  external  action  then  it  should  not  be  allowed.  for 
exanple  we  should  not  allow  cash  dispensing  since  we  cannot 
coapensate  for  it.  Howevar,  one  of  the  partitions  Bay  be 
allowed  to  execute  external  actions  provided  that  this 
action  is  not  repeated  in  other  partitions  at  partition 
aerge. 
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A.  IWTBODOCTIOH 

In  this  chapter  we  will  review  previous  nethods  proposed 
to  deal  with  network  partitions.  As  we  saw  in  chapter  2  the 
eost  used  alternative  was  to  allow  only  one  partition  to 
update  the  database.  All  other  partitions  were  to  stop  the 
updating  activity  on  their  databases  in  order  to  facilitate 
the  database  reconciliation  at  partition  serge. 

Section  B  presents  a  very  brief  review  of  3one  of  these 
sethods.  Ve  do  not  spend  toa  such  tise  analyzing  then 
because  availability  is  significantly  restricted  and  we  are 
sore  interested  in  continuous  operation  of  the  different 
partitions  under  partitioning.  Also  none  of  these 
approaches  openly  states  how  conflicting  versions  of  data- 
objects  are  detected  or  what  is  to  be  done  with  then  upon 
partition  serge. 

We  are  specially  interested  in  high  availability  of 
data,  so  the  sethods  presented  on  section  c  which  allow 
non-stop  operation  of  partitions  will  be  presented  in  sore 
detail. They  are  two  recently  proposed  sethods,  the  first  one 
uses  the  version  vector  sechaniss  in  order  to  detect  file 
conflicts.  This  approach  is  sore  suitable  for  an  operating 
systes  environment.  The  second  approach  is  based  on  semantic 


knowledge  about  operations  on  the  data  stored  in  the 
distributed  database. 

B.  APPROACHES  IHVOLVIHG  OH E  OPBBATIIG  PABTXTIOH 
1.  Voting 

There  are  soae  proposed  Voting  based  systeas  in  the 
literature  [  15  ],  [9]  .  There  are  two  ways  to  iapleaent 
these  Systess.  The  first  one  is  sore  suitable  for  fully 
replicated  databases.  Here  each  site  is  assigned  a  weight 
(a  nusber  of  votes) .  When  a  partition  occurs  the  sites  in 
the  partition  with  a  aajority  of  votes  are  allowed  to 
process  the  transactions  (i.e.  update  data-objects) .  Sites 
in  other  partitions  go  down  or  are  allowed  to  process  read¬ 
only  transactions.  The  advantage  of  this  approach  is  that 
if  a  user  has  access  to  a  site  which  is  up  he  has  access  to 
the  entire  database,  however  users  that  have  access  only  to 
down  sites  are  restricted  to  read  the  data  only. 

The  second  iapleuentation  T9]  is  sore  general  in  the 
sense  that  it  does  not  require  a  fully  replicated  database. 
The  users  desiring  to  aodify  an  object  aust  lock  it  by 
obtaining  a  aajority  in  a  vote.  That  is,  updates  are  only 
allowed  if  a  aajority  of  sites  vote  to  allow  the  update. 
Since  there  can  be  at  aost  one  partition  containing  a 
aajority  of  sites,  any  object  will  be  updated  in  at  aost  one 
partition.  However,  it  nay  happen  that  there  will  be  no 


1 

partition  which  contains  a  majority  of  sites  so  in  this  case 
updates  are  not  allowed  in  any  partition. 

\  Consistency  is  easy  to  preserve  since  at  partition 

serge  ainority  sites  would  receive  the  aissed  updates  and 
apply  thea  to  their  copies  in  tine  stamp  order.  This  is  a 
5  clear  exaaple  where  autual  consistency  is  guaranteed  at 

expense  of  availability.  &  disadvantage  of  voting  in 
general  is  that  it  aay  be  unacceptable  for  the  minority 
%  sites  to  be  prevented  froa  operating  during  a  partition. 

2.  Tokens 

*  In  this  approach  it  is  assuaed  that  each  data-object 

has  a  token  which  can  be  passed  froa  copy  to  copy.  Only 
sites  in  the  partition  containing  the  token  are  permitted  to 
I  modify  the  object.  In  other  words  if  the  token  for  every 

data-object  accessed  by  a  transaction  resides  at  some  site 
in  a  given  partition  then  the  transaction  nay  be  executed  in 
■  that  group,  so  using  tokens  might  be  less  restrictive  than 

using  voting. 

This  approach  seeas  to  be  best  suited  for  a  file 

i 

system,  where  transactions  access  a  single  data-object.  The 
problea  of  having  transactions  accessing  more  than  one  data 
object  is  that  there  aay  be  transactions  that  cannot  be 
executed  at  any  site  since  the  necessary  tokens  are  in 
different  partitions.  llso  a  disadvantage  is  that  tokens 
can  be  lost  (i.e.  in  a  hard  crash)  ,  and  the  problem  of 
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recreating  tokens  is  nontrivial. 


Furthermore  there  is  a 


danger  of  Baking  a  resource  unavailable  if  vhen  the  parti¬ 
tion  occurs  the  token  vas  in  a  very  rarely  used  part  of  the 
network. 
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This  aethod  was  originally  proposed  in  ■*  1  ].  In  this 
approach  each  data-object  has  a  primary  site  which  is  a 
single  site  that  is  to  be  appointedresponsible  for  an 
object's  activities.  Transactions  are  executed  in  a  parti¬ 
tion  if  it  contains  the  primary  sites  of  all  the  objects  in 
the  read  and  write  sets  of  the  transaction. 

This  approach  may  provide  better  availability 
(similar  to  the  token  approach)  than  the  voting  scheme,  but 
also  suffers  from  the  some  problems  with  respect  to  sites  in 
other  partitions,  that  is,  it  may  be  unacceptable  for  them 
to  operate  without  updates. 

Rote  that  the  idea  of  primary  sites  and  tokens  is 
the  same,  but  in  this  case  the  "token"  cannot  move  around 
and  thus  cannot  be  lost.  Sowever,  the  token  approach  offers 
more  flexibility  because  the  "primary  site"  may  vary  dynami¬ 
cally  as  required.  Also  a  disadvantage  of  the  primary  sites 
approach  is  that  upon  partitioning  if  a  primary  site  vas 
involved  in  a  site  crash  and  a  backup  site  is  elected  as  the 
new  primary  site  then  consistency  problems  can  arise  since 
the  information  stored  in  the  original  primary  site  is  not 
available  to  the  backup  site  (i.e  recent  updates). 


< 
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4.  Reliable  Setworks 

I 


This  approach  was  adopted  in  SDD-1  [5].  In  this 
system  all  possible  transactions  are  divided  into  classes 
I  with  variable  synchronization  levels.  The  classification  is 

■ade  a  priori  and  requires  the  knowledge  of  the  allowed 
operations  and  their  semantics.  Conflicts  of  transactions 
|  of  the  soae  class  are  avoided  by  a  technique  called  pipe¬ 

lining  based  on  the  assumption  that  in  aost  applications  the 
operations  in  a  database  are  known  a  priori  and  that  aost  of 
i  then  do  not  conflict,  ihat  pipelining  does  is  to  allow  only 

one  transaction  of  each  class  to  execute  at  a  tine  in  a 
global  tine  stamp  order. 

I  Coaaunications  in  the  SDD-1  are  based  on  the  use  of 

a  "reliable  network"  [12],  which  guarantees  that  aessages 
are  going  to  be  delivered  eventually,  even  when  a  partition 
!  occurs.  Messages  are  saved  in  "spoolers"  to  be  transaitted 

following  a  break  in  coaaunications.  In  the  case  of  a 
partitioned  network,  non  conflicting  classes  can  clearly 

i 

operate,  but  the  solution  for  conflicts  within  classes 
clearly  can’t  be  iapleaented  due  to  the  lack  of  coaaunica- 
tion  anong  partitions,  which  prevents  the  exchange  of 
aessages  necessary  to  pipeline  the  transactions.  Thus  no 
guarantee  of  post-partition  consistency  exists  because 
nothing  is  done  to  prevent  conflicts  between  transactions 
when  the  partitions  aerge. 
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C.  1PPB01CHBS  IIT0L7 IBS  MULTIPLE  OPEBITIIG  PABflTIOIS 


i-  latsifltt  issisi 

This  approach  was  first  presented  in  *16]  and  vas 
used  in  the  design  of  LOCUS,  a  local  network  operating 
system  at  UCLA.  It  was  intended  for  antosatic  detection  of 
■utaal  inconsistency  between  files  upon  recovery  fros 
network  failures  and  specially  upon  partition  aerge. 
However,  the  results  did  not  generalize  to  transactions  that 
accessed  sore  than  one  file. 

Parker  and  Basos  [17]  extended  the  "version  vector" 
aechanisa  originally  used  to  iapleasnt  this  approach  so  as 
to  detect  inconsistency  when  sore  than  one  file  is  used  by  a 
transaction.  Is  isportant  to  note  that  this  approach  is 
intended  primarily  for  an  environment  where  file  updated 
rates  are  aoderate  and  conflicts  occur  only  rarely.  In  this 
subsection  we  are  going  to  give  a  detailed  presentation  of 
this  approach. 

a.  Preliminary  Definitions 

In  this  subsestion  we  present  some  definitions 
which  are  required  in  order  to  understand  this  approach.  In 
origin  point  OP(f)  of  a  file  f  is  a  global  unique  identi¬ 
fier*  which  is  assigned  to  f  when  it  is  created.  Although 


*For  example  the  pair  (time  of  creation, site  of 
creation) 
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f*s  naae  can  change  the  origin  point  reaains  as  an  iaautable 
attribute*  Note  that  two  files  based  on  a  coaaon  one  can 
have  the  saae  origin  point. 

*  Sftlfi  conflict  occurs  when  two  or  aore  files 
froa  the  various  partitions  have  the  saae  naae  but  different 
origin  point.  A  mg&gtt  conflict  occurs  when  two  or  aore 
files  have  the  saae  origin  point  but  different  naaes  and/or 
different  contents.  A  fill  S2flflis£  is  detected  after  a 
partition  if  either  a  naae  or  a  version  conflict  is 
detected.  To  restore  autual  consistency  file  conflicts  aust 
be  reconciled  so  that  file  naaes  again  uniquely  identify  a 
file. 

*  partition  graph  6(f)  for  a  file  f  is  a 
directed  acyclic  graph  (DAS)  which  is  labelled  as  follows: 
The  source  and  sink  nodes  are  labelled  with  the  naaes  of  the 
sites  in  the  network  that  contain  copies  of  file  f.  Each 
node  can  only  be  labelled  with  site  naaes  appearing  on  its 
ancestors  and  each  naae  in  a  node  appears  on  exactly  one  of 
its  descendants. 

A  version  ISS£2I  for  a  fils  f  is  a  sequence  of  n 
pairs  where  n  is  the  nunber  of  sites  that  store  f.  The  i-th 
pair  (Si: Vi)  counts  the  nunber  Vi  of  updates  to  f  aade  at 
site  Si.  A  set  of  version  vectors  are  coapatible  when  one 
vector  is  at  least  as  large  as  any  other  vector  in  every 
site  conponent  for  which  they  have  entries.  A  version  vector 


is  an  encoding  of  the  partial  order*  describing  the  set  of 
opiates  aade  at  various  sites.  Independent  updates  leading 
to  incoaparable  versions  in  the  partial  order,  have  incoapa- 
rable  vectors  as  result,  A  set  of  vectors  conflict  when  they 
are  not  coapatible. 

For  exaaple,  suppose  that  file  f  is  stored  in 
sites  1  and  B  .  Initially  the  version  vector  associated  vith 
f  will  be  <ki  0,  B:0>  every  tine  f  is  todified  in  one  site  the 
version  vector  will  change  accordingly.  If  f  is  aodified  in 
B  then  the  nev  version  vector  will  be  <A:0,B:1>.  The  version 
vectors  <A:0,B:1>  and  <A:1,B:0>  conflict  because  no  vector 
doainates  the  other. 

An  execution  graph  s  *  G(T1,...,Tn)  is  a  DAG 
vith  nodes  CO, LI ,C1, . . .  Ln,  Cn,Ln  +  1  where  Li  is  the  lock  and 
Ci  is  the  coanit  operation  of  transaction  Ti  repectively.  CO 
initializes  all  files  and  Ln+1  reads  all  files?  The  edges  of 
G  are  pairs  vhere  either  x*Li  and  y»Ci  or  y  reads  vhat  x 
writes. 

b.  Description  of  the  Approach 

In  the  case  of  aultiple  file  conflicts,  version 
vectors  alone  are  not  sufficient  to  detect  conflicts,  thus 
an  additional  aechanisa  is  reguired  in  order  to  achieve  this 


•A  partial  order  is  a  binary  relation  which  is  symnetric 
and  transitive. 

’These  are  duaay  transactions  used  to  give  syaaetry  to 
the  graph. 


goal.  Conceptually,  non  serializability  can  be  detected  by 
aeans  of  a  precedence  graph,  A  precadence  graph  is  coaposed 
of  an  execution  graph  of  a  schedule  of  operations  and  all 
edges  foraed  by  operations  vith  intersecting  read  and  write 
sets.  If  a  precedence  graph  is  acyclic  then  the  execution 
graph  within  it  is  serializable. 

k  set  of  file3  s  is  put  into  conflict  if  there 
exists  an  schedule  of  transactions  ri,...,  Tn  whose  execu¬ 
tion  graph  is  not  serializable  and  one  or  more  files  in  s 
are  also  in  the  readset  of  any  of  the  transactions  of  the 
schedule.  If  S  is  put  into  conflict  then  the  version  vector 
sequences  for  the  sets  S1,...Sn  will  be  incoapatible.  Note 
that  the  sets  S1,...Sn  are  the  readsets  of  the  schedule  of 
transactions  T1,...,Tn. 

with  these  concepts  in  aind  if  we  want  to  detect 
file  conflicts  for  f,  we  aust  check  all  transaction  sets  of 
files  S  containing  f  for  serializability  errors.  A  way  to 
do  this  is  to  have  a  log  where  all  the  readsets  of  the 
transactions  that  had  been  executed  are  stored.  An  opera¬ 
tion  called  extent  (f)  is  defined  to  obtain  the  set  of  files 
that  are  involved  with  f  by  soae  of  the  readsets  stored  in 
the  log.  In  aatheaatical  notation: 

extent  (f)  »  (  g  /  (f,  g)  is  in  8  ♦  ) 
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In  plain  notation  extent  (£)  is  the  set  of  files 
that  in  one  way  or  another  are  related  to  f  in  the  readset 
of  transactions  stored  in  the  log.  Por  example  if  the  log 
contains: 


Head  set 

<T1) 

*  C 

f. 

f3,  f 4  ] 

Bead  set 

(T2) 

*  c 

f  3# 

f4,  f 5  ] 

Bead  set 

(T3) 

«  c 

fl. 

f2  ] 

Bead  set 

(T4) 

» c 

f  3, 

«  3 

Then  extent  (f)  *  [ 

f. 

f3. 

f4,  f  5,  f  6  J. 

This  is  because 

the  transitive  closure  iaplies  that  since  f3  and  f4  were 
related  with  f  in  Headset  (T1)  then  any  other  file  related 
with  f3  and  f4  is  going  to  be  also  related  with  f  and  so  on. 

Two  inportant  consequences  of  the  extent  defini¬ 
tion  are  that  a  file  is  put  into  conflict  if  and  only  if  its 
extent  is  put  into  conflict  and  that  extent  divides  the  set 
of  all  files  into  equivalence  classes.  In  the  exanple  above 
note  that  extent  of  f5  is  [  f,  f3#  f'4r  f4,  f5,  £6  ]  and  thus 
extent  (f)  3  extent  (f5)  . 

Stored  values  of  the  equivalence  classes  and 
their  version  vector  is  a  11  what  is  needed  in  order  to 
detect  eultiple  file  conflicts.  The  stored  set  of  classes 
is  called  a  log  filter.  The  algorithm  presented  for  multi¬ 
file  conflict  detection  i3  as  follows  (  LF  represents  the 
log  filter) : 

(1)  LP  *  null 
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(2)  Repeat  steps  (3)  to  (4)  each  tiae  a  transaction 
coa a its 

(3)  If  the  readset  of  the  transaction  is  contained  in 
soae  set  S*  in  the  LF  then  attach  the  version  vector 
sequence  corresponding  to  the  files  in  the  readset  of 
the  transaction  with  null  vectors  as  place  holders 

(4)  If  S  is  not  already  contained  in  LF,  incorporate  S 
and  its  corresponding  version  vector  sequence  to  LP 
using  the  fast  union-find  algoritha 

(5)  To  check  if  a  file  is  in  conflict  get  extent  of  the 
file  in  the  log  filter  and  see  if  it  has  incompatible 
version  vector  sequences.  If  it  has,  then  return 
conflict. 

Rote  that  instead  of  keeping  a  list  of  sequences 
of  version  vectors  for  ever;  update  node  in  the  system,  log 
filters  are  used  to  reduce  the  uuaber  of  sequences  of 
version  vectors  the  systea  needs  to  store  as  log  informa¬ 
tion.  That  is,  in  order  to  detect  conflicts  it  is  only 
needed  to  store  those  sequences  which  are  not  doainated  by 
any  other  sequence. 

The  conflict  resolution  policy  presented  by 
Parker  and  Raaos  is  based  on  the  notion  of  a  transaction. 
Any  file  update  operation  aust  be  within  the  transaction 
(between  the  begin  and  end  statements) .  A  get  stateaent  is 
defined  which  inforas  the  systea  about  which  files  the  user 
plans  to  use.  This  get  stateaent  will  check  if  all  the 
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flies  given  to  it  are  consistent.  If  there  exist  a  file  in 
conflict  then  the  transaction  is  not  axecnted.  Each  file 
specified  in  the  get  stateaent  is  locally  locked  for  the 
duration  of  the  transaction.  When  the  transaction  ends  the 
systea  updates  the  log  filter  and  the  updates  are  coaaited 
siaultaneously.  If  a  file  is  found  to  conflict  at  this 
aoaent  then  it  should  be  rolled-back.  The  transaction 
coapletes  and  all  its  locks  are  released. 

The  proposed  approach  night  be  seen  siailar  to 
"optiaistic"  concurrency  control  [4],  [I1*]  where 
conflicts  are  detected  during  and/or  after  the  transactions 
execution.  It  could  be  used  for  partition  handling  for 
these  concurrency  control  aechanisas  as  follows.  When 
working  in  a  partitioned  node*  the  users  are  notified  of 
file  conflicts  whenever  a  transaction  is  started  and  the 
partition  is  being  aerged  or  is  already  aerged.  Once  a  file 
conflict  is  detected  no  updates  are  perforaed  on  that  file 
until  it  is  reconciled. 

2.  ssaaaiis  Safislaiaa 

This  approach  was  developed  by  Faissol  *8]  and  is  by 
far  the  aost  coaplete  presentation  of  a  nethod  to  deal  with 
the  network  partition  problea.  The  approach  is  based  on  the 
use  of  seaantic  knowledge  about  the  applications  in  order  to 
allow  updates  in  independent  partitions.  Database  opera¬ 
tions  are  divided  into  classes  of  seaantics  in  order  to 
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redace  the  amount  of  semantic  inf oraation  that  aust  be 
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supplied  to  the  DBMS.  Each  class  shares  a  coaaon  merge 
algorithm  and  information  gathering  routines. 

a.  Semantics  of  Operations  and  Database 
Reconciliation 

Seaantic  information  is  supplied  to  the  DBMS  in 

three  foras: 

(1)  The  class  of  semantics  of  each  operation. 

(2)  A  set  of  integrity  assertions  for  each  operation. 

(3)  The  program  code  for  each  operation. 

with  this  information*  the  DBHS  will  appropriately  modify 
the  behavior  of  the  user  operations  under  a  network  parti¬ 
tion  in  a  way  that  guarantees  that  reconciliation  can  be 
made  automatically  upon  partition  merge.  In  this  approach 
the  applications  programmer  will  be  in  charge  of  extending 
the  requirements  for  semantic  integrity  in  a  way  that 
allows  partitioned  mode  operation.  Semantic  integrity  is 
provided  by  a  mix  of  integrity  assertions  and  strong  data 
types.  Integrity  assertions  are  used  mainly  for  those 
constraints  that  may  vary  with  the  ocurrence  of  network 
partitions.  Two  sets  of  these  assertions  are  specified  one 
for  normal  operation  and  one  fow  partitioned  mode.  They 
will  be  automatically  enforced  by  the  DBHS  depending  on  the 
status  of  the  system.  Dnly  when  irrecoverable  external 
actions  are  involved  it  is  necessary  to  restrict  user's 
actions  by  having  more  strict  integrity  assertions. 
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Operations  are  divided  into  five  classes  of 
semantics  depending  on  their  properties.  The  fi£st  c^ass  of 
seaantics  (class  A)  involve  operations  defined  as  replace 
value  and  update  single  objects,  that  is,  those  operations 
that  update  an  object  without  examining  the  database  and 
have  no  associated  integrity  assertions.  An  example  of 
these  operations  are  the  update  of  names  and  addresses.  The 
second  class  of  semantics  (class  B)  involve  operations  that 
are  compressible,  commutative,  update  single  objects  and 
have  partitionable  integrity  assertions.  A  set  of  opera¬ 
tions  on  an  account  are  compressible  since  we  can  replace 
several  credits  by  only  one  credit  that  is  equivalent  to  the 
rest.  Two  operations  are  commutative  if  the  order  in  which 
they  are  executed  can  be  chaged  producing  an  equivalent 

schedule.  Por  example  credit  and  debit  operations  on  an 

% 

account  are  commutative.  An  integrity  assertion  is  defined 
as  partitionable  if  we  can  derive  from  it  a  set  of  integrity 
assertions,  one  for  each  partition,  such  that  if  each  asser¬ 
tion  is  satisfied  in  its  respective  partition  then  the 
original  assertion  is  satisfied  at  partition  merge.  The 
third  class  of  seaantics  (class  Cl  involve  operations  that 
either  are  commutative  and  invertible  or  are  commutative  and 
have  partitionable  integrity  assertions.  An  operation  is 
invertible  is  there  exists  another  operation  which  will 
restore  the  database  to  the  initial  state,  that  is,  to  the 
value  it  had  before  the  execution  of  the  first  operation. 
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For  example  the  debit  ops  ration  can  be  inverted  if  it  is 
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followed  by  a  credit  for  the  sane  amount.  Note  that  if 
operation  Oi  is  invertible  by  operation  Oj  and  is  commuta- 
tive  with  all  operations  in  a  schedule  then  it  is  invertible 
even  if  there  exist  soae  operations  between  Oi  and  Oj  in  the 
schedule.  The  fourth  class  of  seaantics  (class  D)  involve 
operations  that  are  invertible.  Finally,  the  fifth  class  of 
seaantics  (class  E)  involve  operations  that  do  not  contain 
irrecoverable  external  actions. 

is  we  have  seen  these  seaantic  classes  go  froa 
the  aost  siaple  operations  to  the  aost  complex.  &n  impor¬ 
tant  restriction  that  aust  be  aentioned  is  that  the 
invertibility  property  of  operations  implies  that  no  irre¬ 
coverable  external  action  lay  be  allowed.  In  order  to  store 
information  necessary  to  the  reconciliation  algorithm  a 
history  type  is  defined  for  each  class  of  seaantics.  The 
set  of  all  history  objects  created  in  one  partition  is 
defined  as  the  partition  history.  &  partition  history  will 
in  general  contain  objects  of  various  classes.  It  is 
created  when  the  partition  occurs  and  it  is  deleted  when  all 
aeries  are  complete.  The  first  three  history  types  are 
stored  as  a  set,  that  is  no  ordering  is  defined.  The  two 
last  history  types  are  stored  as  sequences.  Figure  3.1 
suaaarizes  the  class  of  seaantic  of  each  operation  and  the 
information  necessary  to  store  in  each  partition  history. 
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CLASS 

|  TIPS  OP 

|  OPERATION 

HISTORY 

REQUIRED 

A 

j  replace  value 

j  updates  single  objects 

naae  of  objects 

B 

|  coapressible 

|  cosautative 

|  updates  single  objects 
j  partitionable  assertions 

naaes  and  initial 
values  of  aodified 
objects 

Cl 

j  cosautative 

I  partitionable  assertions 

operation-ID, 
and  paraaeters 

C2 

1  coaautativa 

|  invertible 

operation-ID 
and  paraaeters 

D 

|  invertible 

operation-ID  and 
naaes  in  R/H  set 

E 

|  no  irrecoverable 

|  external  actions 

operation-ID, 
paraaeters,  naaes, 
values  in  R/W  set 

Figaro  3*1  Sea antic  classes  and  histories 


As  we  eentionei  before,  operations  with  class  A 
seeantics  involve  the  lowest  overhead  and  are  the  nost 
restricted.  Objects  can  be  reconciled  siaply  be  choosing 
the  valae  in  one  partition  and  installing  it  in  all  others. 
Note  that  only  the  last  aodif ication  of  each  object  is 
required  to  be  stored  in  the  partition  history  for  class  A 
operations.  Operations  of  class  B  have  little  overhead 
also.  Reconciliation  of  objects  can  be  made  independently 


for  each  object  by  a  single  operation  that  suanarizes  all 
the  updates  made  in  the  other  partition.  since  it  is  not 
required  that  they  be  invertible  they  aay  contain  irrecover¬ 
able  external  actions.  Operations  with  class  C  sesantics 
can  aodify  several  objects  at  a  tiae  end  therefore,  recon¬ 
ciliation  of  the  database  is  Bade  on  an  operation  by 
operation  basis.  it  partition  serge,  each  operation  that 
ran  in  one  partition  is  executed  in  all  the  others. 
Operations  can  be  executed  in  any  order  since  they  are 

cOBBUtative.  The  subclass  of  seaantics  Cl  allows  irrecover¬ 
able  external  actions  but  subclass  C2  is  not  allowed  to 
execute  these  hind  of  actions  since  they  are  invertible.  If 
sone  integrity  assertion  is  violated  by  an  operation  of 
class  C2  then  sone  operations  are  inverted  until  a  consis¬ 
tent  database  is  o.btained.  Operations  with  class  0 

seaantics  are  aore  coaplex  since  they  Bust  be  executed  in 
order  in  all  the  other  partitions  at  partition  aerge.  In 
this  case  conflicts  aay  arise  because  of  integrity  asser¬ 
tions  violations  or  because  operations  that  involve  the  saae 
data-objects  were  executed  in  different  orders  in  each 

partition.  To  reconcile  the  database  conflicting  operations 
aust  be  inverted,  taking  care  of  inverting  also  operations 
that  read  values  produced  by  inverted  operations  and  then 
reexecuting  these  operations  in  all  parti+^ns.  Clearly  the 
partition  aerge  algoritha  is  aore  coaplex.  Operations  with 
class  2  seaantics  include  all  operations  except  those  with 
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irrecoverable  external  actions  combined  with  nonpartition- 
able  integrity  assertions.  To  reconcile  the  database  modi- 
fied  by  operations  in  this  class  it  is  necessary  to  undo 
these  operations  by  res  taring  the  N  before"  iaages  of  all 
■odified  objects  taking  tha  saae  precautions  as  with  opera¬ 
tions  of  class  D  semantics. 

It  is  iaportant  to  be  aware  that  there  exist 
soae  type  of  objects  called  by  Paissol  "critical  types" 
which  cannot  be  handled  by  this  approach  to  partitioned  node 
operation.  Por  exaeple  a  bank  stop  payment  order.  Pailure 
to  handle  these  kind  of  cases  automatically  does  not  invali¬ 
date  the  eethod  since  they  are  infrequent  a  no  ugh  to  be 
handled  by  extraordinary  aeans  (i.e.  by  telephone), 
b.  System  Operation 

This  approach  assumes  that  a  concurrency  control 
mechanism  exists  in  each  partition  to  handle  concurrent 
execution  of  transactions.  Also  it  is  assumed  that  a 
recovery  mechanism  removes  the  effects  of  a  system  crash 
from  the  database,  system  and  applications  software  are  not 
directly  available  to  the  users  of  the  database  system,  who 
interact  through  a  set  of  p re-defined  transactions.  System 
operations  are  added  to  those  supplied  by  the  application  in 
order  to  enforce  semantic  integrity  and  to  allow  partitioned 
mode  operation.  When  the  entire  network  is  connected  the 
system  is  in  normal  operation  and  all  the  copies  of  the 
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replicated  database  are  mutually  consistent.  Every  tiae  a 
transaction  is  submitted,  a  ST&TtJS  operation  checks  a 
PAST- FLAG  object  which  is  a  system  defined  type  and  contains 
information  about  the  state  of  the  network.  If  the  result 
of  the  check  is  "network  conected",  then  the  user  operation 
is  executed.  It  is  followed  by  a  check  operation  on  each  of 
the  integrity  assertions.  If  all  assertions  are  satisfied 
then  the  irrecoverable  external  actions  are  started  (if  they 
exist)  and  the  transaction  terminates.  If  the  assertions 
are  not  satisfied  then  the  transaction  is  aborted.  Note 
that  in  normal  operation,  the  only  additional  overhead  is 
the  status  operation  because  it  is  always  required  to  main¬ 
tain  semantic  integrity.  If  status  returns  "partition  merge 
in  progress"  the  DBAS  must  check  if  the  operation  can  be 
executed.  This  depends  on  the  clhss  of  semantics  to  which 
the  operation  belongs.  For  example,  operations  with  class  A 
semantics  have  to  check  if  the  target  object  is  not  locked 
by  the  merge  algorithm,  while  operations  of  class  E  seman¬ 
tics  have  to  check  if  their  read  and  write  sets  do  not 
intersect  with  the  read  and  write  sets  of  remaining  opera¬ 
tions  in  each  partition  HISTORY,  in  order  to  be  executed. 
If  status  returns  "network  partitioned"  the  appropriate 
information  is  stored  in  the  partition  history  for  the 
class  of  semantics  of  the  operation.  If  the  operation  is 
not  within  one  class  of  semantics  aLlowed  to  run  in  parti¬ 
tioned  mode  then  it  is  rejected,  otherwise  the  operation  is 
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executed  and  a  check  of  integrity  assertions  is  perforaed. 
When  two  or  sore  partitions  serge,  a  systen  process  perforss 
database  reconciliation  using  the  information  stored  in  the 
partition  history  for  each  class  of  S9iantics.  i  different 
serge  algorithm  is  invoked  for  each  class  of  sesantics. 
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In  chapter  3  we  reviewed  previous  work  done  on  network 
partitioning.  9e  were  particularly  interested  in  analyzing 
two  recently  proposed  aethods  by  Parker  [17]  and  Paissol  [8] 
because  they  allow  non-stop  operation  of  each  partition  and 
reconcile  the  conflicts  at  partition  aerge  tiae. 

Since  our  objective  i3  to  attain  high  availability  of 
data  in  the  distributed  database  we  will  concern  ourselves 

in  this  chapter  with  ttie  developaent  of  an  alternate 

approach,  that  will  also  allow  continuous  operation  of  the 
partitions  during  network  partition. 

The  approach  proposed  in  this  chapter  relies  on  prece¬ 
dence  graphs  in  order  to  detect  conflicts  [20]  and  on 

serializability  as  the  correctness  criteria  for  database 

reconciliation. 

In  order  to  aake  basic  concepts  aore  understandable  the 
discussion  that  follows  assuaes  that  there  are  only  two 
partitions  and  that  during  partition  aerge  all  the  opera¬ 
tions  on  the  database  are  suspended.  In  later  sections  we 
relax  these  constraints  and  present  soae  extensions  to  the 
approach. 
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B.  DESCRIPTION  OP  THE  APPROACH 


i.  Stsliiiaaci  ssfiaUlsas  aai  igiaaBtigin 


It  is  assumed  that  within  one  partition  there  is  a 
aechanisa  to  provide  concurrency  control  and  atonic  trans¬ 
actions.  A  nuaber  of  snch  nechanisns  have  been  described  in 
the  literature  [3]#  [11],  [15],  [18]  .  Therefore  we 

assuae  that  the  systen  operates  as  if  only  one  transaction 
is  executed  at  a  tine  and  that  rejected  transactions  have  no 
effect  on  the  database.  It  is  also  assuned  that  if  a  systen 
crash  occurs  in  the  aidle  of  a  transaction,  the  recovery 
aechanisa  will  reaove  its  affects  froa  the  database. 

For  the  rest  of  the  chapter  transactions  will  have 
the  following  structure: 

(1)  A  transaction  T  wishing  only  to  read  a  logical  data 
object  Z,  executes  a  Read-Loch  Z,  which  prevents  any 
other  transaction  froa  writing  a  new  value  of  X  while 
T  is  reading.  However,  any  nuaber  of  transactions 
can  hold  a  read  -loch  on  Z  at  the  sane  tine. 

(2)  A  transaction  wishing  to  change  the  value  of  logical 
data  object  Z  first  obtains  a  write-loch  for  X  and 
no  other  transaction  can  obtain  either  a  read  or 
write- loch  on  the  object. 

(3)  Messages  are  sent  to  all  sites  holding  physical 
copies  of  data-object  Z  notifying  them  to  change 
their  copies  to  reflect  the  aew  aodification  before 
releujing  the  write- loch. 
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(4)  The  transaction  coaaits  and  only  then  all  its  lochs 
are  released  (Thus  ve  assaae  two- phase  locking  to 
assure  serializable  execution)  . 

Definition  4.1:  The  logical  data  objects*  that  a  transaction 
read  is  its  readset. The  logical  data  objects  that  a 
transaction  writes  is  called  the  transaction's 
vriteset.  They  will  be  represented  as  readset  (T)  and 
writeset  (T)  respectively. 

Note  that  in  particular  we  do  not  assuee  that  the 
writeset  of  a  transaction  is  always  a  subset  of  the  readset. 
This  allows  a  sore  realistic  aodel  which  adait  the  possi¬ 
bility  that  a  transaction  reads  a  set  of  objects  (the 
readset)  and  writes  a  set  of  objects  (the  writeset) ,  with 
the  option  that  an  object  X  could  appear  in  either  one  of 
these  sets  or  both.  For  exasple  in  the  transaction: 

HEAD  X;  HEAD  I;  Z  »  X  *  X;  X  -  I;  write  Z;  write  X 
the  readset  is  X ,  T  and  the  writeset  is  X,  Z 

Definition  4.2:  A  precedence  graph  S(7,E)  is  a  directed 

graph,  where  the  vertices  (?)  correspond  to  the  set  of 
transactions  T1,...,Ik  within  a  schedule  S,  and  the 
edges  (E)  represent  precedence  relations  between  the 
transactions. 


■See  Chapter  2,  section  B. 
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Definition  4.3:  A  schedule  S  for  &  set  of  transactions 
T1...Tk  is  serializable  if  its  precedence  graph  is 
acyclic. 

Proposition  4.1:  Two  transactions  Ti  and  Tj  are  cossutative 
if: 

1)  HEADSET  (Ti)  is  disjoint  with  WRITESET  (Tj)  and 

2)  WHITE  SET  (Ti)  is  disjoint  with  READSBT  (Tj)  and 

3)  WRITE  SET  (Ti)  is  disjoint  with  WRITESET  (T  j) 

Proof  Outline:  The  only  way  in  which  transaction  Ti  eay 

affect  the  outcose  of  Tj  is  by  sodifying  shared  objects 
in  the  database  and  viceversa.  Since  only  the  readsets 
are  allowed  to  intersect  and  reid  operations  are  coaeu- 
tative  (the  order  in  which  transactions  read  a  shared 
object  is  unisportant)  there  is  no  real  interaction 
between  the  transactions.  Therefore  changing  the  order 
of  execution  produces  an  equivalent  scheduler  which 
iaplies  coaautativity. 

Definition  4.4:  Within  one  partition  schedule  we  define  a 

transaction  Ti  to  be  a  descendant  of  transaction  Tj  if 
HEADSET  (Ti)  intersects  WRITESET  (Tj) . 

Definition  4.5:  The  relatives  of  a  transaction  T  is  the  set 
of  all  transactions  that  functionally  depend  on  T  (i.e. 
the  set  of  all  descendants)  . 

In  order  to  verify  the  correctness  of  the  approach 
given  in  the  next  section  we  need  a  formal  definition  of 
correct  partitioned  node  operation.  We  will  adopt  the  defi¬ 
nition  given  by  Faissol  [8]. 
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Definition  4.6:  Let  SO  be  a  schedule  composed  by 

transactions  Ti.  ..Tk,  such  that  soae  transactions  in  SO 
were  succesfully  applied  in  partition  1  and  the  rest  in 
partition  2,  resulting  respectively  in  the  schedules  SI 
and  S2,  then  correct  partitioned  node  operation  is 
attained  if  the  following  conditions  are  satisfied: 

(1)  iith  the  inforaation  stored  in  each  partition  it  is 
possible  to  construct  schedules  S3  and  S4  such  that 
schedules  S5  and  S6  are  both  equivalent  to  the  sane 
serial  execution  of  SO,  where:  S5  »  (SI,  S3)  and 
S6  a  (S2,  S4). 

(2)  Ho  transactions  containing  irrecoverable  external 
actions  are  reversed  by  the  partition  merge 
algorithm. 

(3)  ill  integrity  assertions  are  satisfied  after  the 
partition  merge  algorithm  is  executed. 

Note  in  particular  that  only  a  schedule  equivalent 
to  soae  serial  execution  of  SO  is  required  and  not  a 
schedule  equivalent  to  SO.  This  may  cause  different  results 
than  would  have  occured  if  the  network  was  connected,  but 
this  is  usually  accepted  i£  serializability  is  the  correct¬ 
ness  criteria. 

ilso  it  is  important  to  note  that  since  some  trans¬ 
actions  that  would  have  been  executed  with  the  network 
connected  must  be  rejected  in  partitioned  operation,  SO  was 
defined  as  the  schedule  of  transactions  successfully 


executed  in  partitioned  node  (this  does  not  naan  that  all 
transactions  are  cossitcd  since  soae  of  thea  aay  be  aborted 
after  execution  because  of  violation  of  soae  integrity 
assertion) . 

2.  caaiiisi  asissiiaa  sad  aaiafiasa  aasgasUlaiiga 

Our  approach  to  continuous  operation  under  a  network 
partition  is  based  on  the  use  of  presadence  graphs  to  detect 
conflicts  between  partitions  at  nerge  tine  and  to  help  to 
deter nine  a  serializable  schedule  equivalent  to  soae  serial 
execution  of  the  global  schedule  50  defined  in  the  last 
section.  When  a  network  partition  occurs  the  DBMS  within 
each  partition  perforns  two  actions:  first,  activates  a 
aechanisa  that  aborts  transactions  trying  to  execute  an 
irrecoverable  external  action  and  second,  creates  a 
partition-log  which  stores  infornation  necessary  for  the 
reconciliation  algorithn.  The  inforaation  contained  in  the 
partition- log  consists  of  the  transaction-ID,  read  and  write 
sets  of  the  transaction  and  the  old  and  new  values  of  the 
updated  objects  (those  in  the  write  3et) .  The  transactions 
are  recorded  in  the  order  in  which  they  coaait*  within  the 
partition,  that  is,  as  a  sequence  (a  total  order)  .  when 
coaaunication  between  partitions  is  reestablished  no  more 


•Mote  that  Tn  can  execute  in  a  partition  only  if  there 
exist  copies  within  the  partition  for  every  data-object  in 
its  read  and  write  sets. 
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transactions  are  allowed  to  be  processed  until  partition 
serge  is  coapleted  (this  restriction  is  relaxed  in  section 
C).  The  partition  serge  algoriths  is  then  started  to  recon¬ 
cile  the  databases. 

Initially  the  partition  reconciliation  algoriths 
will  construct  for  each  partition  a  precedence  graph  fros 
the  inforsation  contained  in  the  respective  partition-log. 
The  precedence  graph  is  constructed  as  follows: 

(1)  If  transaction  Ti  reads  data-object  X,  and  Tj  is  the 
next  transaction  (if  it  exists)  to  write  in  X  then 
construct  an  edge  fros  Ti  to  Tj. 

(2)  If  transaction  Ti  writes  data-object  X  and  Tj  is  the 
next  transaction  to  write  X  then  construct  an  edge 
fros  Ti  to  Tj. 

(3)  If  transaction  Ti  writes  data-object  x  and  Tj  reads  x 
before  any  other  transaction  writes  x  then  construct 
an  edge  fros  Ti  to  Tj.  Mark  this  edge  as  a  descendant 
edge. 

It  is  isportant  to  note  that  the  partition  prece¬ 
dence  graph  does  not  have  to  be  constructed  at  partition 
serge  tine,  but  can  be  constructed  gradually  as  new  entries 
are  added  to  the  partition- log.  In  fact,  it  is  better  to  do 
it  this  way  since  at  network  reconnection  the  precedence 
graph  will  be  alsost  cosplete  and  partition  merge  time  is 
reduced.  Also  note  that  each  partition  precedence  graph  is 
going  to  be  acyclic  since  the  schedules  stored  in  the 


partition- logs  are  serializable  (they  had  already  been 
executed) . 

The  next  step  is  to  construct  a  global  precedence 
graph  which  is  going  to  consist  of  each  partition's  prece¬ 
dence  graph  plus  conflict  edges  between  partitions.  Since 
transactions  were  allowed  to  run  "in  parallel"  in  their 
respective  partition  withoat  coordination  between  thea,  the 
conflict  edges  represent  the  interaction  anong  transactions 
froa  different  partitions.  Therefore,  a  transaction  that 
reads  a  data-object  in  one  partition  aust  precede  any  trans¬ 
action  that  writes  that  dita-object  in  the  other  partition 
to  aantain  consistency.  So  a  conflict  edge  froa  Ti  to  Tj  is 
constructed  if  transaction  Ti  in  one  partition  reads  or 
writes  data-object  X  and  transaction  Tj  in  the  other  parti¬ 
tion  writes  X. 

Once  the  global  precedence  grapb  is  constructed  a 
topological  sort  is  executed  on  the  graph  and  if  a  cycle  is 
found,  one  of  the  transactions  involved  in  the  cycle  (the 
one  with  less  descendants)  and  all  of  its  descendants  are 
rolled-back  in  the  partition  where  they  were  executed.  The 
entry  in  the  partition-log  corresponding  to  each  rolled-back 
transaction  is  send  to  a  re-execution  list.  If  a  node  can 
be  extracted  by  the  topolog ical-sort  then  the  values  of  the 
objects  updated  by  the  transaction  represented  by  the  node 
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ara  forwarded10  to  the  other  partition  and  the  corresponding 
entry  in  the  partition-log  is  deleted.  Tha  process  is 
repeated  until  all  transactions  in  the  precedence  graph  had 
been  forwarded  to  the  other  partition  or  send  to  the 
re-execution  list.  That  is  we  have  no  more  entries  in  the 
partition- logs. 

The  transactions  in  the  re-execution  list  are  then 
executed  in  both  partitions  and  if  any  violation  of  integ¬ 
rity  assertions  occurs  the  transaction  is  rolled-back  and 
its  entry  in  the  re-execution  list  is  deleted.  This  can 
happen  because  we  have  altered  the  order  in  which  non- 
conmutative  transactions  were  executed.  When  the  algorithn 
terminates  we  are  going  to  have  a  consistent  database 
throughout  the  network. 

After  the  brief  discussion  of  the  approach  taken  we  are  now 
ready  to  present  the  merge  algorithm. 

Algorithm  HEBGE 

(1)  send  message  "partition  merge  in  progress"  to  each 
partition. 

(2)  Construct  the  precedence  graph  for  each  partition 
extracting  information  from  their  respective 
partition-log. 

(3)  Repeat  steps  (4)  to  (5)  for  each  partition. 


toactually,  only  the  updated  values  of  copies  of  data- 
objects  that  also  exist  in  the  other  partition  are 
forwarded.  We  will  refer  to  this  every  time  the  word 
forwarded  is  used. 


(4)  Repeat  step  S)  foe  each  eatry  in  the  partition- log 


starting  at  the  first  ona. 

(5)  Coapare  the  readset  of  this  antry  with  the  read  and 
write  sets  of  every  antry  in  the  partition-log  of  the 
other  partition.  Any  tiae  a  natch  is  found  add  a 
directed  edge  (fron  the  transaction  in  this  entry  to 
the  transaction  in  the  other  partition's  antry)  to  the 
global  precedence  gra  ph.  Hark  this  edge  as  a  conflict 
edge. 

(6)  Run  the  TOPOLOGICAL -EXEC  algorithm  on  the  global 
precedence  graph  until  all  nodes  on  the  graph  had  been 
deleted. 

(7)  Execute  algoritha  RE. 

(8)  Send  message  "nerge  completed"  to  each  partition. 

(9)  Terminate. 

At  end  of  the  aerga  algoritha  the  global  precedence 
graph  and  all  entries  in  both  partition-logs  will  have  been 
deleted.  Re  now  present  the  supporting  algorithms 
topological-exec  and  RE.  Hhat  tha  TOPOLOGICAL-EXEC  algor¬ 
ithm  basically  does  is  a  topological  sort  on  the  global 
precedence  graph  to  obtain  a  serial  schedule  for  the  trans¬ 
actions  in  both  partitions.  A  topological  sort  generates  a 
linear  ordering  with  the  property  that  if  Ti  is  a  pred¬ 
ecessor  of  Tj  in  the  graph  then  Ti  precedes  Tj  in  the  linear 
order.  A  linear  order  with  this  property  is  called  a  topo¬ 


logical  order  [13] 


Since  a  linear  order  is  serial  by  nature  a  topolo¬ 
gical  order  gives  a  serial  order  which  satisfies  the 
precedence  relations  between  transactions. 

It  is  important  to  note,  however,  that  a  topological 
order  can  be  obtained  only  if  the  global  precedence  graph  is 
acyclic  and  thus  if  there  is  a  cycle  it  eust  be  reeoved  froe 
the  graph  before  the  topological  sort  can  continue. 
Algor it he  TOPOLOGICAL-EXEC  uses  algorithm  REMOVE-CYCLE  in 
order  to  reaove  one  of  the  edges  that  fore  the  cycle  to 
obtain  an  acyclic  graph.  Every  tiae  the  TOPOLOGICAL-EXEC 
algorithm  is  able  to  extract  a  node  from  the  graph  it 
forwards  the  updates  made  by  the  transaction  (contained  in 
the  part  it  ion-log  entry)  that  corresponds  to  the  graph  node 
to  the  other  partition.  9a  now  present  the  algorithm. 
Algorithm  TOPOLOGICAL -EX  EC 

(1)  Repeat  steps  (2)  to  (6)  for  each  node  in  the  global 
precedence  graph. 

(2)  If  every  node  has  a  predecessor  then  execute  algorithm 
Remove-Cycle  and  go  to  (1). 

(3)  Pick  a  node  which  has  no  predecessors. 

(4)  Forward  the  updated  values  of  the  data-objects  modi¬ 
fied  by  the  transaction  (contained  in  the 
partition-log  entry)  that  corresponds  to  the  selected 
node  to  the  other  partition. 

(5)  Delete  the  entry  from  the  respective  partition-leg 
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(6)  Delete  the  node  and  ill  edges  leading  out  of  the  node 
froa  the  global  precedence  graph. 

(7)  Terminate. 

As  we  indicated  be  fora ,  any  tine  a  cycle  is  found 
the  algoritha  Reaove-Cycle  is  invoked  to  remove  one  of  the 
nodes  involved  in  the  cycle.  This  Beans  that  the  effects  of 
the  transaction  contained  in  the  entry  that  corresponds  to 
the  node  and  the  effects  of  all  the  relatives  of  the  trans¬ 
action  Bust  be  removed  from  the  database.  In  order  to  avoid 
extensive  roll-back  of  transactions  as  much  as  possible  the 
transaction  chosen  to  be  removed  will  be  the  one  with  less 
relatives.  The  transactions  will  be  rolled-back  in  inverse 
order  of  execution  and  their  entries  in  the  partition- log 
will  be  moved  to  a  re-execution  li3t  to  be  executed  again 
later*  He  now  present  tl^e  algorithm. 

Algorithm  REH07E-CTCLE 

(1)  Repeat  step  (2)  for  each  node  related  tc  another  by  a 
conflict  edge  in  the  precedence  graph. 

(2)  Compute  the  number  of  relatives  of  the  node  by 
counting  the  descendant  edges  that  go  out  either  of 
the  node  or  its  descendants. 

(3)  Choose  the  node  with  less  number  of  relatives  and 
create  a  relative  sat  containing  all  the  relatives  of 
the  node.  If  there  is  aore  than  one  node  with  the  same 
number  then  choose  the  one  invclved  with  aore  conflict 
edges. 
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(4)  Mo to  the  partition- log  entry  corresponding  to  the 
selected  node  to  a  roll-bach  list  and  repeat  step  (5) 
for  each  following  entry  until  the  relative  set  is 
eapty. 

(5)  If  the  entry  corresponds  to  a  node  in  the  relative  set 
then  aove  the  entry  to  the  roll-back  list  and  delete 
the  node  froa  the  set. 

(6)  Repeat  steps  (7)  to  (9)  for  each  entry  in  the  roll¬ 
back  list  starting  with  the  last  entry  and  going 
backwards  until  the  list  is  eapty. 

(7)  Use  the  systea  supplied  OHDD  d per at ion  to  reaove  the 
effects  of  the  transaction,  corresponding  to  the  entry 
by  placing  the  "before"  values  of  the  updated  objects 
in  their  correspondent  partitions. 

(8)  Move  the  entry  in  the  roll-back  list  to  the^ 
re-execution  list. 

(9)  Delete  the  node  that  corresponds  to  the  entry  and  all 
edges  to  or  froa  the  node  in  the  global  graph. 

(10)  Terainate. 

It  the  end  of  algor itha  TOPOLOGICAL- EXEC  there  is  a 
re-execution  list  which  contains  all  the  transactions  froa 
both  partitions  that  were  rolled-back  in  order  to  nantain  a 
global  consistent  database  state.  These  transactions  are  to 
be  rerun  in  both  partitions  by  the  algoritha  BE.  In  this 
case  integrity  violations  can  occur  since  we  have  changed 
the  execution  order  of  transactions  that  are  noncoaautative. 
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Note  that  when  algo ri the  T0P3L0GICAL=EXEC  terminates 
the  databases  of  each  partition  are  in  a  consistent  state, 
that  is,  they  are  aantaiaing  mutual  consistency  since  the 
saee  transactions  had  been  executed  in  both  partitions.  He 
now  present  algorithm  BE. 

Algorithm  BE 

(1)  Bepeat  steps  (2)  to  (5)  for  each  entry  in  the 
re-execution  list. 

(2)  Bun  the  specified  transaction  in  both  partitions. 

(3)  If  any  integrity  assertion  is  violated  reject  the 
transaction. 

(4)  Delete  the  current  re-execution  list  entry. 

(5)  Terainate. 

Be  proceed  now  to  show  the  correctness  of  the  approach. 
Proposition  4.2:  Ilgoritha  SERGE  correctly  reconciles  a 

database  that  has  been  independently  modified  by  trans¬ 
actions  in  different  partitions. 

Proof:  Let  SO  be  the  schedule  in  the  whole  system  with  SI 
and  S2  executed  in  partition  1  (PR1)  and  partition  2 
(PR 2)  respectively.  Be  must  prove  that  each  of  the 
requirements  for  correctness  iefined  in  section  B, 
subsection  1  are  satisfied  when  algorithm  SERGE  is 
executed.  In  order  to  make  the  proof  more  understand¬ 
able  we  consider  three  cases  according  to  the  initial 
configuration  of  the  global  precedence  graph 

constructed  by  steps  (1)  to  (5)  from  algorithm  SERGE. 


Case  1  :  global  precedence  graph  with  no  conflict 
edges.  This  means  that  all  transactions  in  one  parti¬ 
tion  are  commutative  with  all  transactions  in  the  other 
partition.  Step  (6)  executes  algorithm  TOPDLOGICAL-EXEC 
which  will  obtains  a  topological  order  of  transactions 
fros  both  partitions  and  will  forward  the  values  of 
objects  updated  by  transactions  in  one  partition  to  the 
other.  In  this  case  the  resulting  schedules  of  trans¬ 
actions  executed  in  PB1  and  PB2  will  be  equivalent  to 
the  global  schedule  S3  and  not  only  to  a  serial  execu¬ 
tion  of  it.  This  is  lue  to  the  fact  that  transactions 
in  PB1  are  commutative**  with  transactions  in  PB2  and 
viceversa  and  they  can  be  executed  in  any  order*2 
without  affecting  their  results. 

Case  2  :  Global  precedence  graph  with  conflict  edges 
but  without  cycles.  In  this  case  the  graph  is  also 
serializable.  &  conflict  edge  represents  the  fact  that 
for  the  sase  logical  data-object  with  physical  copies 
in  different  partitions,  a  transaction  in  one  partition 
read  the  value  of  a  copy  of  this  data-object  while  in 
other  partition  a  transaction  updated  the  value  of  the 
copies  of  the  data-ob j ect. 


**See  definition  of  coaautativity  in  section  B. 

l2The  order  in  which  they  were  executed  in  their  own 
partition  aust  be  preserve!. 
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Algorithm  TOPOLOGICAL -EXEC  by  step  (3)  will  sake  sure 
that  the  transactions  which  read  an  object  forward 
their  updated  objects  before  transactions  that  write 
the  object  in  the  other  partition.  The  rest  of  trans¬ 
actions  are  coimutative  with  the  transactions  in  the 
other  partition  so  they  are  no  probles. **  However. the 
resulting  schedule  executed  PB1  and  PB2  nay  not  be 
eguiwalent  to  SO  but  only  to  a  serial  execution  of  it, 
namely,  the  one  produced  by  algoritha  TOPOLOGICAL-EXEC. 
Case  3  :  Global  precedence  graph  with  cycles.  In  this 
case  the  graph  is  not  serializable.  Cycles  aust  be 
removed  to  obtain  a  serializable  schedule.  Step  (2)  of 
algorithm  TOPOLOGICAL -EXEC  will  detect  cycles  and 
remove  all  offending  transactions  and  its  relatives 
using  algorithm  BEHOVE -CYCLE.  Benoved  transactions  are 
sent  to  the  re-execution  list.  Once  the  graph  has  no 
cycles  we  are  again  in  case  2.  Values  of  objects 
updated  by  transactions  are  forwarded  to  the  correspon¬ 
dent  partition  in  topological  order  by  step  (4)  of 
algoritha  TOPOLOGICAL- EXEC.  Immediately  before  step  (7) 
of  algoritha  HEBGE,  the  schedules  in  both  partitions 
are  equivalent  with  each  transaction  not  removed  from 
the  graph  executed  in  both  sides  and  transactions 
removed  from  it  in  tha  re-execution  list.  Step  (8)  will 


*>Saae  as  in  case  1 


then  execute  algor ithm  HE,  which  reruns  the 
transactions  that  were  resowed  in  both  partitions  in 
the  saae  order,  checking  integrity  assertions.  If  soae 
integrity  wiolation  occurs  at  this  point, the  trans¬ 
actions  are  aborted  in  both  partitions.  At  this  stage 
ewery  transaction  in  the  global  schedule  SO  except 
those  with  integrity  violations  had  been  executed  in 
both  partitions.  Thus,  the  resulting  schedules  are 
equivalent  to  soae  serial  execution  of  SO,  namely,  the 
one  produced  by  algoritha  SERGE.  Note  that  this 
schedule  will  be  composed  by  transactions  executed  in 
topological  order  by  algorithm  topological-exec,  plus 
transactions  rerun  by  algoritha  RE  minus  transactions 
with  integrity  conflicts.  Correctness  condition  (1)  is 
satisfied  because  in  each  of  the  three  cases  we  are 
going  to  have  at  least  a  schedule  equivalent  to  some 
serial  execution  of  SO  in  both  partitions,  condition 
(2)  is  satisfied  because  no  irrecoverable  external 
actions  are  allowed.  Condition  (3)  is  also  satisfied 
because  transactions  that  violate  integrity  assertions 
are  aborted. 

As  we  can  see  the  merge  algorithm  is  somewhat 
complex.  This  is  due  to  our  interest  in  allowing  more  avail¬ 
ability  of  data  and  to  our  concern  in  trying  to  avoid  as 
much  transaction  roll-back  as  possible. 


I 


a.  ia  SzaiBls 

In  order  to  aake  clear  how  the  approach  works  we 


present  in  this  subsection  an  ezaapla. 


Let 

SI  *  Til,  T12, 

T13  be 

the  schedule  of 

transactions 

executed  in  partition 

1  (PH  1) 

where: 

Headset  (Til)  * 

i 

Writeset  (Til)  * 

X,2 

Headset  (T1 2)  * 

U,  V  ,  2,  P 

;  Writeset  (T12)  * 

V 

Headset  (T1 3)  - 

p*q  ; 

Writeset  (T13)  * 

p,q 

and 

S2  »  T21 ,T22,T23, 

T24  the 

schedule  executed 

Ln  PR2  where: 

Headset  (T21)  ■ 

<3«u»r  ; 

Writeset  (T21)  » 

u,r 

Headset  (T22)  * 

lfB,n  ; 

Writeset  (T22)  = 

l,a 

Headset  (T23)  « 

u,w  : 

Writeset  (T23)  * 

w 

Headset  (T24)  * 

*.y#z  ; 

Writeset  (T24)  * 

y 

When 

the  partitions 

find  out  that  they  can 

coaaunicate 

algor itha  HERGE  is  started.  Step  (2)  of  this  algor itha  will 
construct  the  precedence  graph  of  each  partition  with  the 
inforaation  stored  in  their  respective  partition-log.  Steps 
(3),  (4),  (5)  of  the  algor  itha  will  construct  the  global 
precedence  graph  by  adding  the  conflict  edges  to  the 
existing  graph.  Figure  4.1  shows  the  global  precedence  graph 
constructed. 

Once  the  global  precedence  graph  is  constructed  step 
(6)  will  call  algoritha  TOPOLOGICAL-EXEC  to  obtain  a  topolo¬ 
gical  order  of  the  nodes  in  the  graph.  Step  (3)  of  algoritha 
TOPOLOGICAL-EXEC  will  select  the  node  corresponding  to  T22 
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Figure  4. 1  Initial  precadence  graph 


since  it  is  the  only  node  without  a  predecessor.  Step  (4)  of 
this  algor ithe  will  forward  the  valua  of  the  objects  updated 
by  T22  (i.  e.  l,a)  to  partition  1  (P81).  Steps  (5)  and  (6) 
will  delete  the  entry  in  the  partition-log  that  corresponds 
to  that  node  and  will  delete  the  node  froa  the  graph  respec¬ 
tively.  Figure  4.2  shows  the  state  of  the  graph  after  the 
deletion. 

Step  (2)  of  algor itha  TOPOLOGICAL- EXEC  will  deter- 
aine  that  all  reaaining  nodes  have  a  predecessor  so  a  cycle 
exists.  Algoritha  REHO?B_CY CLE  is  than  invoiced  and  steps  (1) 
and  (2)  of  this  algoritha  oount  the  lescendants  of  each  node 
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Figure  4.2  First  sodi fication  to  precedence  graph 


related  to  another  by  conCl let  edges.  T12  and  T24  have  the 
sane  number  of  descendants  (0  descendants)  but  T24  is 
involved  with  two  conflict  edges  so  step  (3)  of  this  algor- 
itha  chooses  T24  to  be  rolled-back  and  creates  an  empty  set 
of  descendants.  Step  (4)  moves  the  partition-log  entry 
corresponding  to  the  selected  node  to  the  roll-back  list  and 
step  (5)  is  skipped  since  the  node  has  no  relatives.  Steps 
(7)  ,  (8)  ,  (9)  remove  the  effects  of  the  transaction  from  the 
database  by  using  the  0HD0  operation,  move  the  entry  corre¬ 
sponding  to  T24  to  the  re-execution  list,  and  delete  the 
node  and  edges  to  or  from  it  from  the  graph  respectively. 
Figure  4.3  shows  the  new  state  of  the  precedence  graph. 


Pigor*  4.3  Second  nodification  to  the  precedence  graph 

Algorithm  TOPOLOGICAL -EX EC  reassuies  execution  and  by  step 
(3)  picks  Til  since  now  it  has  no  predecessors. Step  (4) 
forwards  the  values  updated  by  Til  to  PB2  and  steps  (5)  and 
(6)  delete  the  entry  corresponding  to  Til  and  the  node  in 
the  graph  respectively.  Figure  4.4  shows  the  remaining 
graph. 

Succesive  applications  of  step  (3),  (4),  (5)  pick 

T12,  T21,  T23#  T1 3  in  that  order  (T23  and  T13  could  be 

picked  in  any  order);  send  the  updates  of  the  transactions 
to  the  partition  where  they  did  not  execute  and,  delete  the 
respective  entries  fros  the  partition-log  and  nodes  from  the 
graph.  Figures  4.5  show  the  next  state  of  the  precedence 
graph. 
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Figure  4.4  Third  Modification  to  the  precedence  graph 


Figure  4.5  Fourth  aodification  to  the  precedence  graph 


66 


Now  algoritha  HE RGB  reassumes  execution  and  by  step 


(7)  it  executes  algoritha  RE.  The  only  entry  in  the 
re-execution  list  was  T24  so  this  transaction  vill  be 
executed  in  both  partitions. If  any  integrity  assertion  is 
violated  the  transaction  will  be  aborted  by  step  (4)  of  this 
algoritha  and  by  step  (5)  the  entry  is  deleted  froa  the 
re-execution  list. 

Hote  that  the  schedule  of  transactions  executed  in 
both  partitions  is  nov  equivalent  to  a  schedule  executed  in 
topological  order. This  is  due  to  the  fact  that  sending  the 
updates  Bade  by  any  transaction  in  PR1  to  PR2  is  equivalent 
to  executing  that  transaction  before  any  transaction  in  PR2 
that  writes  in  those  objects  and  viceversa.  Thus  the  equiva¬ 
lent  topological  order  of  execution  in  both  partitions  will 
be: T22,  Til,  T12,  T21,  T23,  T13,  and  T24  if  it  is  not 

abort  ed. 

C.  EXTEHSIOHS  TO  THE  APPROACH 

Section  B  presented  the  approach  we  proposed  to  allow 
the  operation  of  distributed  database  systems  under  network 
partitions.  In  order  to  simplify  the  description  of  the 
systems  operation  and  of  the  merge  algorithm,  we  made  a 
nuaber  of  restrictions  and  promise  to  relax  them  later. 
This  is  the  purpose  of  this  section. 
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Subsection  1  presents  a  discussion  of  the  locking 
requirements  and  associate  modifications  to  the  merge  algor¬ 
ithms  that  will  allow  normal  operation  while  a  partition 
merge  is  in  progress.  The  main  objective  of  this  section  is 
to  reduce  the  delay  to  which  incomiag  transactions  would  be 
subjected  while  the  merge  algorithm  is  in  progress.  This 
delay  can  be  substantial  for  partitions  of  long  duration  or 
for  database  systems  with  high  activity  rates. 

Subsection  2  presents  a  discussion  of  partial  partition 
merges  when  there  are  sore  than  two  partitions.  Sites  may 
become  partitioned  and  then  join  again  in  various  orders. 
The  easiest  solution  would  be  to  wait  until  the  network  is 
completely  reconnected  to  perform  partition  merges.  This 
would  not  only  increase  the  degree  of  inconsistency  that 
must  be  reconciled  later  but  also  would  increase  the  over¬ 
head  and  time  required  for  partition  merge. 

Finally  subsection  3  presents  a  discussion  of  the  situ¬ 
ations  in  which  irrecoverable  external  actions  can  be 
allowed  when  network  is  partitioned.  The  main  objective  of 
this  subsection  is  to  increase  user*s  availability  to  data 
in  the  database,  so  that  a  higher  number  of  transactions  can 
be  executed  during  the  partition. 

i.  jjgjmai  aBsialiaa  Basiaa  Merge 

The  merge  algorithm  described  in  section  B,  subsec¬ 
tion  2  assumed  that  no  transaction  was  allowed  until  the 


algor it ha  was  completed.  This  assumption  relieved  us  from 
worrying  about  interference  from  other  transactions  and 
locking  issues  in  the  description  of  the  algorithm. 

This  subsection  presents  the  locking  requirements 
necessary  to  allow  normal  operation  while  the  merge  algor¬ 
ithm  is  in  progress.  He  will  see  that  these  requirements 
are  relatively  simple ,  despite  the  fact  that  the  algorithm 
is  somewhat  complex. 

normal  operation  during  partition  merge  can  be 
allowed  if  the  new  transactions  do  not  interfere  with  trans¬ 
actions  being  reconciled  by  the  merge  algorithm.  That  is  we 
need  the  new  transactions  to  be  commutative  with  all  the 
remaining  transactions  in  each  partition-log  in  order  to 
execute  then  in  normal  node.  Otherwise  new  conflicts  will 
arise  that  could  not  be  resolved. 

To  assure  that  the  new  transaction  is  commutative  we 
need  to  compare  its  read  and  write  set  with  the  read  and 
write  sets  of  all  transactions  still  in  the  partition- logs. 
If  no  match  is  found  then  we  know  it  is  commutative  and  can 
be  executed  without  problem.  However  if  there  is  a  match 
then  the  transaction  will  have  objects  in  its  read  and  write 
set  that  are  yet  to  be  reconciled  and  so  it  must  be  delayed 
to  avoid  new  conflicts. 

Hote  that  even  if  the  new  transaction  is  commutative 
there  may  be  a  significant  delay  before  it  can  be  executed 
since  we  are  introducing  additional  overhead  in  order  to 


coa pare  its  read  and  write  sets  with  the  ones  of  the  trans¬ 
actions  that  resain  to  be  reconciled. 

However,  we  can  attespt  an  optimization  if  we  use 
the  idea  of  a  data-object  log  (DO  log)  [4]  to  store  informa- 
tion  about  the  status  of  the  object.  The  information  we 
need  to  store  is  just  a  " mark"  that  indicates  that  the 
object  was  used  by  soae  transaction  in  partitioned  node.  In 
order  to  accoaplish  this  we  need  to  establish  the  policy 
that  while  in  partition  a  ode  operation  the  first  tine  an 
object  is  used  by  any  transaction  a  Data-ob ject-log  is 
created  and  a  value  (i.e.  0)  is  stored  in  it.  After  this 
every  tiae  a  transaction  uses  the  object  the  value  in  the  DO 
log  is  incremented  (i.e.  by  1)  ,  this  process  stops  when  the 
partition  merge  algorithm  initiates  its  execution.  A  small 
aodification  should  be  made  to  the  HEBGE  algorithm.  Every 
time  an  entry  is  deleted  from  a  partition-log  or  from  the 
re-execution  list,  the  value  stored  in  the  Do  log  of  each 
object  used  by  the  transaction  that  corresponds  to  that 
entry  is  decreaented  by  1  in  the  partition  where  the  trans¬ 
action  executed  while  in  partitioned  mode.  If  when  deleting 
an  entry  the  value  stored  in  the  DO  log  is  0  then  the  DO  log 
is  deleted. 

In  that  way  new  transactions  operating  in  normal 
aode  that  are  willing  to  use  an  object  just  have  to  check  in 
each  partition  if  the  object  has  an  associated  Do-log  and  if 
so  then  the  transaction  is  delayed  until  the  Do- log  of  the 
object  is  deleted  in  every  partition  where  it  existed. 
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Se  can  see  that  the  overhead  involved  is  ouch 
sea  Her  than  the  one  we  need  to  compare  the  read  and  write 
sets  with  all  of  the  transactions  being  reconciled  and  so 
the  delay  should  be  considerable  decreased.  Also  the  over¬ 
head  iaposed  by  the  MERGE  algorithm  with  this  addition  is 
not  very  significant  and  the  additional  storage  required  is 
rather  snail. 

2.  Eaiiial  Eatiiliaa  sscsas 

In  the  presentation  of  our  approach  we  assumed  that 
only  two  partitions  existed.  However,  this  aay  not  be  the 
case  and  although  it  should  be  quite  infrequent  there  aay  be 
aore  than  two  partitions  that  can  join  in  different  orders 
depending  on  which  coaaun ication  lines  are  reestablished 
first.  This  section  relaxes  the  assuaption  that  only  two 
partitions  exist  and  discusses  how  to  deal  with  the  problem 
of  partial  partition  merges. 

As  we  mentioned  before  a  straightforward,  but  simple 
minded,  solution  could  be  to  wait  until  the  network  is 
completely  reconnected  and  then  start  the  partition  merge 
algorithm  involving  all  partitions  at  the  same  time.  However 
this  solution  has  serious  disadvantages  such  as  having  more 
restricted  operation  within  a  partition  for  aore  time  and 
having  a  significantly  increased  overhead  in  order  to  merge 
all  the  partitions  at  the  same  time. 


An  alternate  solution  could  be  to  allow  partial 
partition  serges  to  occur  in  different  orders  without  any 
restriction  and  as  soon  as  two  partitions  discover  that  they 
can  coeeunicate  between  thee.  This  can  be  done  because  we 
have  in  the  partition** log  enough  inforaation  (naaes  and 
values  of  objects  in  read  and  write  sets)  to  avoid  repeated 
updates  to  the  saae  object  and  out  of  order  execution  of 
transactions  in  different  partitions.  However  this  will 
require  an  extensive  coaparison  of  partition**logs  and  addi¬ 
tional  lists  to  store  the  entries  that  were  reaoved  in  a 
previous  aerge  in  order  to  see  if  it  is  required  to  reaove 
these  entries  froa  a  new  partition  joining  the  existing 
partition. 

An  exaaple  can  clarify  these  concepts.  Suppose  we 
have  the  partition  graph  shown  in  figure  4.6  .  Initially  the 
network  partitions  foraing  two  groups,  the  first  group 
contains  only  H3  and  the  second  group  HI  and  N2.  Each 
partition  is  assigned  a  unique  partition-ID  which  is 
included  in  all  entries  aade  to  their  respective  partition- 
logs.  Later,  another  partition  occurs  resulting  in  N1  and 
H2  working  separately. 

Again  a  unique  partition-ID  is  assigned  to  these 
partitions  and  every  entry  to  the  partition-log  froa  now  on 
will  have  the  new  partition-ID.  Hote  that  each  of  the  new 
foraed  partitions  HI  and  H2  "inherits"  the  partition- log 
entries  of  the  past  partitions,  aantaining  these  entries 


Figure  4.6  Partial  serges  in  a  partition  graph 


their  original  partition-ID .  When  N2  discovers  that  it  can 
coaaunicate  with  H3  they  start  the  serge  algorithm  to  recon¬ 
cile  their  databases.  However,  no  entry  of  their 
partition- logs  can  be  delated  since  they  will  be  needed  to 
conpare  new  aerges  if  the  transactions  corresponding  to 
those  entries  have  been  executed  before  i.e.  when  N1  and  H2 
were  in  the  sane  partition.  We  aust  also  preserve  entries 
in  the  roll-back  lists  because  if  entries  corresponding  to 
HI  when  it  was  in  the  sane  partition  with  H2  are  rolled- 
back,  then  when  these  two  sites  are  reconnected  again  we 
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■ust  roll- back  these  entries  froa  N2  also.  Siailar  precau¬ 
tions  will  have  to  be  taken  with  entries  in  the  re-execution 
list,  where  soae  transactions  that  were  originally  executed 
in  (Ml,  N2)  will  not  be  able  to  be  re-executed  in  (M2,  N3) 
if  soae  object  is  present  only  in  Ml,  so  those  transactions 
should  be  delayed  in  their  re-execution  until  CM  joins  (M2, 
N3)  .  As  we  can  see  the  required  protocols  to  aake  possible 
partial  aerges  in  arbitrary  orders  would  be  pretty  involved 
and  the  high  overhead  voald  aake  thea  impractical.  Thus, 
although  it  is  possible  to  allow  partial  aerges  in  different 
orders  we  will  not  pursue  this  solution  because  of  its 
complexity. 

A  far  aore  practical  solution  could  be  obtained  if 
we  restrict  partial  aerges  in  such  a  way  as  to  allow  only 
syaaetric  aerges,  that  is,  if  we  require  that  the  partition 
graph  be  a  syaaetric  direct  acyclic  graph.  Figure  4.7  shows 
the  way  in  which  aerges  would  be  executed  if  we  coaply  with 
this  restriction.  As  we  can  see  subgrahps  are  syaaetrical, 
so  partitions  aerge  in  the  saae  order  in  which  they  were 
partitioned. 

Having  this  aerge  pattern  the  only  aodification  we 
need  to  introduce  in  the  HE  HGE  algorithm  is  that  we  need  to 
retain  the  entries  in  the  new  partition-log.  Note  that 
these  entries  will  be  stored  in  topological  order,  that  is, 
in  the  order  in  which  the  topological-exec  algorithm 
executed  the  transactions  in  both  partitions.  Entries  can 
be  deleted  when  the  sink  node  is  reached. 


Figure  4.7  Syaaetric  partial  aerges  in  a  partition  graph 


Be  now  present  the  aodifications  required  in  order 
to  allow  partial  aerge  in  a  syaaetric  directed  acyclic  graph 
(DAG).  Recall  that  algorithm  SERGE  uses  algoritha 

Topological*- Exec  to  delete  the  entries  fron  the  partition- 
logs  so  the  aodificatioa  will  be  in  this  algoritha. 
Bodification  to  algoritha  TOPOLOGICAL-EXEC. 

(5)  If  the  sink  node  ia  the  partition  graph  has  been 
reached  then  delete  the  entry  froa  the  respective 
partition- log,  else  store  the  entry  in  the  new 
partition- log. 
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Re  can  nov  proceed  to  shoe  the  correctness  of  the 

■odification  aade  to  the  algoritha. 

Proposition  4.4:  If  the  partition  graph  is  a  syaaetric  DAG 

then  partial  partition  serges  are  correctly  executed  by 
the  aodified  SERGE  algoritha. 

Proof  outline:  since  the  partition  graph  is  a  syaaetric  DAG 
each  sub-graph  represents  a  situation  which  is  the  sane 
as  the  one  for  the  original  BER3E  algoritha.  That  is, 
each  subgraph  will  consist  of  a  subsource  node  and  a 
subsink  node  that  are  the  saaa  and  thus  no  duplicate 
updates  or  oat  of  order  execution  of  transactions  aay 
occur.  With  the  aodification  in  step  (5)  of  algoritha 
topological- exec,  partition-log  entries  are  retained 
after  they  have  been  used  to  reconcile  their  respective 
databases.  Thus, these  entries  will  be  available  to  be 
applied  in  the  other  subgraphs.  There  is  no  problea 
with  transactions  that  are  undone  since  they  will  have 
been  executed  only  in  the  current  subgraph  and  since  no 
other  subgraph  was  involved  they  do  not  need  to  be 
undone  elsewhere.  I  he  sane  is  true  for  transactions 
that  violate  integrity  assertions  when  being 
re-executed.  Thus  we  have  no  problea  in  deleting  the 
corresponding  entries  froa  the  partition-log.  Once  the 
sink  node  of  the  entire  graph  is  reached,  we  have  the 
entire  network  reconnected  and  the  sane  situation  as  in 
the  original  BERGE  algoritha.  When  this  occurs,  the 


entries  in  the  partition-logs  can  be  deleted  since  they 
will  have  been  applied  to  all  sites. 


As  we  can  see  thi3  solution  requires  that  partial 
serges  be  done  in  a  sy Metric  way.  Obviously  this  will 
delay  soie  partial  serges  since  we  have  to  wait  until  parti¬ 
tions  in  the  sane  subgraph  can  cossunicate  between  then.  As 
a  natter  of  fact  there  is  going  to  be  sore  to  reconcile 
afterwards  since  partitions  will  continue  operating  in 
partitioned  node.  However  this  solution  is  a  conpronise 
between  the  two  solutions  that  were  nentioned  first  and  we 
think  it  is  a  reasonable  one. 

3.  Allowlaa  sxtasaai  astia&s 

One  of  the  assunptions  nade  in  section  B  was  that  no 
irrecoverable  external  actions  were  allowed  and  transactions 
that  attespted  to  execute  one  of  these  actions  were  aborted. 
In  this  subsection  we  anlyze  in  which  circumstances  irrecov¬ 
erable  external  actions  can  be  allowed. 

Paissol  [8]  proposed  a  solution  to  this  problem**  by 
determining  those  integrity  assertions  that  could  be  parti¬ 
tioned  in  such  a  way  that  if  they  were  not  violated  in  any 
partition  then  they  would  not  be  violated  as  a  whole  when 
the  network  is  completely  connected. Thus  irrecoverable 
external  actions  that  involved  objects  with  these  type  of 


**See  Chapter  3, section  C. 


integrity  assertions  coaid  be  allowed  under  network 
partition.  This  solution  can  also  be  iapleeented  in  the 
approach  presented  in  this  chapter  by  giving  the  DBHS  enough 
seaantic  information  about  integrity  assertions  dealing  with 
external  actions. 

in  alternate  solution  would  be  to  allow  irrecover¬ 
able  external  actions  in  at  most  one  partition.  In  fact, 
there  is  no  reason  why  we  could  not  do  this.  In  our  approach 
integrity  assertions  for  each  partition  are  the  sane  as  the 
ones  when  the  network  is  completely  connected.  The  partition 
to  be  chosen  to  allow  these  kind  of  actions  could  be  deter¬ 
mined  by  one  of  the  methods  proposed  in  Chapter  3,  section  B 
(i.  e.  voting).  Precedence  should  be  given  to  partitions 
where  it  is  more  likely  that  external  actions  will  occur. 
For  example  if  after  a  partition  in  a  bank  system  80%  of  the 
automatic  teller  machines  are  in  one  partition,  then  this 
partition  should  be  allowed  to  execute  irrecoverable 
external  actions  (i.e.  cash  dipensing).  This  solution  has 
the  advantages  that  it  does  not  require  extra  integrity 
assertions  and  that  users  which  can  access  the  selected 
partition  will  have  no  restriction  at  all  with  respect  to 
external  actions. 

However,  we  have  to  adopt  some  special  measures  to 
avoid  transactions  that  execute  irrecoverable  external 
actions  to  be  undone  at  partition  merge,  k  way  to  assure 
this  is  to  "mark"  those  transactions  as  "permanent**  in  the 


partition- log  and  so  even  if  it  is  involved  in  a  cycle  and 
is  the  transaction  with  less  descendants  it  will  not  be 
chosen  to  be  rolled-back,  mother  precaution  should  be  taken 
to  avoid  that  at  partition  serge  irrecoverable  external 
actions  be  executed  again.  This  is  achieved  directly  by  the 
proposed  approach  because  transactions  are  not  executed  in 
the  other  partition*  but  only  the  values  of  the  aodified 
objects  are  forwarded. 
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(  7.  coicLgsiois  aid  sassssiiaia  £28  missa  siss&sss 

In  this  thesis  we  analyzed  the  problee  presented  by 
|  network  partitions  on  distributed  database  systeas,  with 

replicated  data.  Because  of  the  need  to  preserve  autual 
consistency  and  to  avoid  irrecoverable  external  actions 
|  which  say  be  perforaed  by  transactions  operating  on  incon¬ 

sistent  data,  it  is  ia possible  to  allow  unrestricted 
operation  under  network  partitions.  Consistency  and  availa- 
<  vility  just  appear  to  be  f undaaentally  incoapatible  goals. 

Existing  solutions  allowing  one  operating  partition 

totally  block  update  transactions  in  all  but  one  partition. 

[  In  this  way  autual  consistency  is  guaranteed  when  databases 

.  * 

froa  different  partitions  are  aerged.  Howevver,  these  solu¬ 
tions  are  not  acceptable  for  aany  existing  ailitary  and 
j  coaaercial  applications  which  require  high  availability. 

For  exaaple,  an  airline  reservation;  systen  will  prefer  to 
conditionally  reserve  a  seat  on  a  flight  for  a  custoaer 
rather  than  telling  hia  to  wait  until  the  partition  is 
rep aired. 

The  two  recently  proposed  solutions  which  allow  aultiple 
operating  partitions  greatly  increase  availability  of 
distributed  systeas  allowing  thea  to  utilize  aore  fully  the 
potential  iaproveaent  provided  by  redundant  data.  The  aethod 
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consisting  of  the  version  vector  mechanism  together  with  the 
log  filter  is  very  simple  in  structure  and  involves  the 
addition  of  only  a  few  new  constructs  to  the  file  system 
design,  namely,  file  origin  points,  version  vectors,  and  the 
log  filter.  Thus  this  approach  requires  very  little  system 
overhead.  However,  the  approach  is  specially  suited  for  an 
environment  in  which  file  updates  are  moderate  and  conflicts 
occur  only  rarely  and  thus  it  will  probably  be  not  very 
useful  in  the  kind  of  environment  characterizing  a  database 
system  with  high  transaction  rates  and  volatility. 

The  approach  involving  semantic  knowledge  is  based  on 

the  addition  of  semantic  constraints  to  those  already 

existing  within  a  particular  application.  These  constraints 

are  enforced  by  the  DBHS  through  the  use  of  strong  data 

types  and  integrity  assertions.  The  use  of  semantic  infor- 
* 

nation  about  data  assures  that  conflict  detection  and 
database  reconciliation  can  be  performed  when  the  network  is 
reconnected.  This  is  perhaps  the  best  existing  solution  to 
the  network  partition  problem  since  it  allows  the  highest 
degree  of  availability  through  the  use  of  different  classes 
of  semantics  of  operations.  Therefore,  in  most  of  the  cases 
the  database  can  be  reconciled  without  the  necessity  of 
rolling-back  transactions  which  had  been  executed  in  parti¬ 
tioned  mode  in  order  to  achieve  mutual  consistency  and  thus 
the  user  will  feel  confident  that  even  in  the  event  of  a 
network  partition  his  transactions  are  going  to  be  executed 
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giving  reliable  results.  However,  this  approach  lacks  gener¬ 
ality  since  the  use  of  data  types  restricts  access  to  the 
database  to  a  set  of  liaited  operations.  Also  for  each 
different  application  senantic  information  about  the  opera¬ 
tions  used  in  each  thea  aust  be  given  to  the  DBBS  in  order 
to  correctly  reconcile  the  database.  Clearly  the  overhead 
incurred  by  the  reconciliation  algorithms  and  the  extra 
inforaation  required  will  increase  proportionally  with  the 
nuaber  of  applications  processed  by  &  systea. 

The  approach  proposed  in  this  thesis  assumes  no  seaantic 

knowledge  about  the  data  and  thus  is  more  general  since  no 

predefined  operations  are  required  and  several  applications 

will  not  increase  the  amount  of  inforaation  necessary  for 

the  reconciliaton  algorithm.  It  can  be  argued  that  a  serious 

disadvantage  of  this  aethod  is  that  the  way  in  which  mutual 

% 

consistency  is  achieved  is  by  rolling-back  conflicting 
transactions  and  then  reexecuting  then  again  and  thus  final 
results  nay  be  different  from  the  ones  obtained  by  the  users 
when  the  network  was  partitioned.  However,  rolling-back 
transactions  should  not  be  the  rule  but  the  exception  since 
in  a  large  class  of  applications  aost  transactions  will 
never  interfere  with  each  other  [5], [6]  .  Also  in  the 
uncoanon  case  in  which  a  conflict  is  detected  the  reconcili¬ 
ation  algoritha  will  roll-back  transactions  only  as  a  last 
resource  when  copies  in  different  partitions  of  the  sane 
logical  data-object  had  been  independently  updated.  Even  in 
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this  case  it  i/i  atteapted  to  ainiaize  the  nuaber  of  trans¬ 
actions  that  need  to  be  rolled-back  in  one  partition  by 
choosing  the  transaction  with  less  descendents. 

Further  research  is  needed  in  this  area  in  order  to 
deteraine  which  is  the  best  aethod  for  dealing  with  network 
partitions  for  different  applications.  Perhaps  a  coabination 
of  seaantic  knowledge  with  the  approach  presented  in  the 
thesis  will  be  the  aost  appropriate  for  soae  applications. 
For  exaaple,  in  aany  coaiercial  and  ailitary  applications 
class  A  seaantics  is  the  aost  freqaent  [8]  and  since  it  is 
the  class  of  seaantics  with  less  associated  overhead,  it  can 
be  used  together  with  the  alternate  approach  presented  here 
in  order  to  redace  the  aaount  of  seaantic  inforaation 
required  by  the  DBAS  and  thas  reducing  the  overhead. 


LIST  OF  REFERENCES 


Alsberg,  P. A.  &  Day,  "A  Principle  for  Resilient 
Sharing  of  Distributed  Resources",  Proceedings  of  the 

2nd  lalacflfl&jgBftl  ssafgsaasa  2a  software  laaiasssiia 

13-15  October  1976. 


Bayer  R.,  "On  the  Integrity  of  Database  and  Resource 
Locking".  Lecture  Notes  ia  Computer  gcjence:  Data 
Base  St steep.  edited  by  Goos  S.  and  Hartaanis  J., 
Springer- Per  lag ,  1975  ,pp.  339-361. 


Badal,  D.Z.,and  Popek,  G.  J.  "A  Proposal  for 
Distributed  Concurrency  Control  for  Partially 
Replicated  Distributed  Databases",  Proceedings  of  the 

liitd  ss ttelai  £2£&sti2£  °a  5l£t£iba£sd  MLi  Baaaagasat 
and  coa du ter  Networks .  August  1978. 


Badal,  D.Z.,  "Concurrency  Control  Overhead  or  closer 
Look  at  Blocking  vs.  Nonblocking  Concurrency  Control 
Bechanisas",  Proceedings  o£  the  Fifth  Berkeley 

BgcEshag  °a  Eisttifeatsd  aati  am  assagai  aad  g2i£5lg£ 

Networks.  San  Francisco,  February  1981. 


Bernstein,  P.  A. ,  Shipaan  D.  W.  ,  Rothnie  J.  B. , 
"Concurrency  Control  in  a  Systea  for  Distributed 
Database  (SDD-1)  ",  A  CN  transactions  oq,  Database 
Systems .  Harch  1980,  pp.  18-51. 


Bernstein  P. A. ,  Shipaan  D.R.  "The  Correctness  of 
Concurrency  Control  Bechanisas  in  a  Systea  for 
Distributed  Databases  (SDD-1)",  ACN  transactions  on 
pptabase  2l2£tSI*  Barch  1980,  pp.  52-68.”* 


Eswaran,  K.P.,  J.N.  Gray,  R.  A.  Lorie,  I.  L.  Traiger, 
"The  Notions  of  Consistency  and  Predicate  Locking  in  a 
Database  systea",  Communications  of  the  ACB,  Vol  13, 
no.  11,  Noveaber,  1976. 


Faissol,  S.z. ,  2f  Distributed  Database 
Svsteas  Undqy  Meteoric  Partitions,  Ph.D.  Thesis 
Dissertation,  cosputer  Science  Departaaent,  University 
of  California,  Los  Angeles,  July  1981. 


Gifford,  D.K.,  "Weighted  Toting  for  Replicated  Data", 

sziBssiaa  2a  ScacaUaa  sistsis  Pri^ci^eg, 
Decesber  1979,  pp.  15  0-162. 


Gray,  J.N.,  R. A.  Lorie,  G.R.  Putzulu,  I. L.  Traiger, 
"Granularity  of  Locks  and  Degrees  of  Consistency  in  a 
Shared  Data  Base",  Modelling  in  Data  Base  Hanaqeeent 
Svsteas,.  G.M.  Nijssen  (editors)  ,  North  Holland 
Publishing  Cospany,  1976. 


Gray,  J.M.  ,  "Notes  on  Data  Base  Operating  Systeas", 
Operating  Systess :  aa  Advanced  Course. 

Springer- Verlag,  Berlin,  1978. 


Haaaer,  H.  6  D.  Shipaan,  "An  Overview  of  Reliability 
Hechanisas  for  a  Distributed  Data  Base  Systea",  Spring 
COHPCOF  78.  San  Francisco,  February  1978,  pp. 63-65. 


Horowitz,  E.and  Sahni,  S. ,  Fundaaentals  of  Data 
Structures.  Cosputer  Science  Press  Inc.,  1976. 


Rung,  H.T.,  J.R .  Robinson,  "On  Optiaistic  Hethods  for 
Concurrency  Control",  ACH  Transactions  og  Database 
Svsteas.  June  1981,  pp.  213-226. 


Henasce,  D.  A. ,  G.J.  Popek,  a.R.  Muntz,  "A  Locking 
Protocol  for  Resource  Coordination  in  Distributed 
Systeas",  ACH  X£aaia£tioa§  22  Database  §zstg§s,  June 
1980,  pp.  103-108. 


Parker,  D.s.,  Popek,  et.  al. ,  "Detection  of  Mutual 
Inconsistency  in  Distributed  Systeas",  Proceedings  q£ 
thq  EAfifc  Ber  keley  Workshop  oa  Distributed  Data 
Hanaoeaent  and  Cosp  uter  Networks,  San  Francisco, 
February  1981,  pp.  17  2-184. 


17.  Parker,  D.S.,  R.A.  Ranos,  *'A  Distributed  File  System 
Architecture  Supporting  High  Availability”, 
Proceedings  lis  Sixth  Berkeley  workshop  23 

Distributed  £&£&  g&  pageient  3  a  a  CoiPater  wet works. 
Asilomar,  February  IS -19,  1982,  pp.  161-183. 


18.  Thoaas,  R.H.,  "  A  Solution  to  the  Concurrency  Control 

Problem  for  Multiple  Copy  Databases",  SLREina  COHPCOH 
78.  San  Francisco,  February  1978,  pp.  56-62. 


19.  Bothnia,  J. B.  6  H.  soodaan,  "A  Survey  of  Research  and 
Development  in  Distributed  Database  Management", 
Proceedings  2f  £32  li&iti  Conference  oa,  Very  Large 
Databases.  Tokyo,  October  1977,  pp.  48-61. 


20.  Oilman,  J.D.,  Principles  o£  Database  Systems.  Computer 
Science  Press  Inc.,  1980. 


86 


IlIXIAi  2I3XfiIfi8Xl2fi  I(ISZ 


No.  copies 

Defense  Technical  Infor nation  Center  2 

Caaeron  station 
Alexandria,  Virginia  22  314 

Defense  Logistics  studies  Inforaation  Exchange  2 

U.S.  Aray  Logistics  Banageaent  Canter 
Fort  Lee,  Virginia  2383 1 

Library,  Code  014  2  2 

Naval  Postgraduate  School 
Bonterey,  California  93  940 

Departaent  Chairaan,  Co  de  52  2 

Departaent  of  coaputer  Science 
Naval  Postgraduate  School 
Bonterey,  California  93  940 

Departaent  Chairaan,  Code  54  1 

Departaent  of  idainistr ative  Sciences 
Naval  Postgraduate  School 
Bonterey,  California  93  940 

Dr.  Noraan  B.  Lyons,  Code  54LB  2 

Departaent  of  idainistr ative  Sciences 
Naval  Postgraduate  School 
Bonterey,  California  93  940 

Dr.  Dusan  D.  Badal,  Code  52ZD  1 

Departaent  of  coaputer  Science 
Naval  Postgraduate  School 
Bonterey,  California  93  940 

Dr.  Noraan  F.  Schneidev ind,  code  54SS  1 

Departaent  of  idainistr ative  Sciences 
Naval  Postgraduate  School 
Bonterey,  California  93  940 

Captain  Peter  L.  Jones  i 

Barine  Corps  Central  Design 
and  Prograaaing  Activity 

Barine  Corps  Developaent  and  Education  Center 
Quantico,  Virginia  2213  4 


10.  LT.  Hi  car  do  Arana  C. 

Hinisterio  de  Harina 

C antral  de  Procesaaient o  de  Datos 
A  v.  Salaverry  s/n 
Lina- Peru 

11.  LT.  JAaier  Da  La  Cuba  B. 
Hinisterio  da  Harina 
Direccion  da  Abas  tec iaiento  Naval 
Base  Naval  dal  Callao 
Lisa-Pern 

12.  LT.  Eduardo  Bresani  T. 

Hinisterio  de  Harina 

Central  de  Procesaaient o  de  Datos 

Av.  Salaverry  s/n 

Liaa-Pern 

13.  LTJG.  Saha  Futaci 
Ataturk  Caddesi  Petek 
Apartaaani  Kat  3  Daire  9 
Bursa  Turkey 


88 


